library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.1
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Background and Overview

DataCamp offer interactive courses related to R Programming. While some is review, it is helpful to see other perspectives on material. As well, DataCamp has some interesting materials on packages that I want to learn better (ggplot2, dplyr, ggvis, etc.). This document summarizes a few key insights from:

This document is currently split between _v003 and _v003_a and _v003_b and _v003_c due to the need to keep the number of DLL that it opens below the hard-coded maximum. This introductory section needs to be re-written, and the contents consolidated, at a future date.

The original DataCamp_Insights_v001 and DataCamp_Insights_v002 documents have been split for this document:


Hierarchical and Mixed Effects Models

Chapter 1 - Overview and Introduction

What is a hierarchical model?

  • Hierarchical data is nested within itself, and can be analyzed using the lme package
    • Example of students in a classroom - may not all be independent of each other due to teacher quality, building conditions, etc.
    • Hierarchical models can help with pooling means across small sample sizes
    • Repeated measurements (test scores each year) are also a common example of data that are not truly independent
  • Hierarchical models can include nested models and multi-level models
  • Regression frameworks can include pool information and random effects (vs. fixed effects) and mixed-effects and linear mixed-effects
  • Repeated sampling can have repeated measures modeling

Parts of a regression:

  • Linear regression and linear model can be used interchangeably for this course - epsilon is the error term, assumed to be normal with zero mean and constant variance
  • The linear model in R is closely related to analysis of variance (ANOVA)
    • lm (y ~ x, myData)
    • anova( lm (y ~ x, myData) )
  • The most basic regression has an intercept, a slope, a single predictor, and an error term
    • The concept can be extended to multiple regression with additional predictors
  • There are some limitations to the multiple regression approach
    • Parameter estimates can be very sensitive to other variables - Simpson’s paradox and the like
    • Need to note that the regression coefficient is “after controlling for . . .” (all the other variables)
    • Interaction terms can be important as well
  • Regressions in R for an intercept for every group are called as lm(y ~ x - 1)
  • The interaction term x1*x2 is the same as x1 + x2 + x1:x2

Random effects in regression:

  • Nested relationships tend to be hierarchical in nature - students are part of classes are part of schools and the like
    • Mathematically, this is referred to as a mapping among the distributions
  • The algebraic representation is that y ~ B*x + eps, with B ~ N(mu, sigma**2)
    • library(lme4) is the best packages for this in R
    • lme4::lmer(y ~ x + (1|randomGroup), data=myData)
    • lme4::lmer(y ~ x + (randomSlope|randomGroup), data=myData)

School data:

  • Appliciation of multi-level models to school data - influence of sex, teacher training, plotting parameter estmates

Example code includes:

rawStudent <- read.csv("./RInputFiles/classroom.csv")

studentData <- rawStudent %>%
    mutate(sex=factor(sex, labels=c("male", "female")), minority=factor(minority, labels=c("no", "yes")))


# Plot the data
ggplot(data = studentData, aes(x = housepov, y = mathgain)) +
    geom_point() +
    geom_smooth(method = 'lm')

# Fit a linear model
summary( lm(mathgain ~ housepov , data = studentData))
## 
## Call:
## lm(formula = mathgain ~ housepov, data = studentData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -168.226  -22.222   -1.306   19.763  195.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   56.937      1.674   34.02   <2e-16 ***
## housepov       3.531      7.515    0.47    0.639    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.63 on 1188 degrees of freedom
## Multiple R-squared:  0.0001858,  Adjusted R-squared:  -0.0006558 
## F-statistic: 0.2208 on 1 and 1188 DF,  p-value: 0.6385
# I have aggregated the data for you into two new datasets at the classroom- and school-levels (As a side note, if you want to learn how to aggregate data, the dplyr or data.table courses teach these skills)
# We will also compare the model outputs across all three outputs
# Note: how we aggregate the data is important
# I aggregated the data by taking the mean across the student data (in pseudo-code: mean(mathgain) by school or mean(mathgain) by classroom), 
# but another reasonable method for aggregating the data would be to aggregate by classroom first and school second

classData <- studentData %>%
    group_by(schoolid, classid) %>%
    summarize_at(vars(mathgain, mathprep, housepov, yearstea), mean, na.rm=TRUE)
str(classData)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  312 obs. of  6 variables:
##  $ schoolid: int  1 1 2 2 2 3 3 3 3 4 ...
##  $ classid : int  160 217 197 211 307 11 137 145 228 48 ...
##  $ mathgain: num  65.7 57.4 49.5 69 68.8 ...
##  $ mathprep: num  2 3.25 2.5 2.33 2.3 3.83 2.25 3 2.17 2 ...
##  $ housepov: num  0.082 0.082 0.082 0.082 0.082 0.086 0.086 0.086 0.086 0.365 ...
##  $ yearstea: num  1 2 1 2 12.5 ...
##  - attr(*, "groups")=Classes 'tbl_df', 'tbl' and 'data.frame':   107 obs. of  2 variables:
##   ..$ schoolid: int  1 2 3 4 5 6 7 8 9 10 ...
##   ..$ .rows   :List of 107
##   .. ..$ : int  1 2
##   .. ..$ : int  3 4 5
##   .. ..$ : int  6 7 8 9
##   .. ..$ : int  10 11
##   .. ..$ : int 12
##   .. ..$ : int  13 14 15
##   .. ..$ : int  16 17 18 19
##   .. ..$ : int  20 21 22
##   .. ..$ : int  23 24 25
##   .. ..$ : int  26 27 28 29
##   .. ..$ : int  30 31 32 33 34 35 36 37 38
##   .. ..$ : int  39 40 41 42 43
##   .. ..$ : int  44 45
##   .. ..$ : int  46 47 48
##   .. ..$ : int  49 50 51 52 53
##   .. ..$ : int  54 55
##   .. ..$ : int  56 57 58 59 60
##   .. ..$ : int 61
##   .. ..$ : int  62 63 64
##   .. ..$ : int  65 66 67
##   .. ..$ : int  68 69 70
##   .. ..$ : int 71
##   .. ..$ : int  72 73 74
##   .. ..$ : int  75 76 77 78
##   .. ..$ : int  79 80
##   .. ..$ : int  81 82 83
##   .. ..$ : int  84 85 86 87
##   .. ..$ : int  88 89 90 91
##   .. ..$ : int  92 93
##   .. ..$ : int 94
##   .. ..$ : int  95 96 97 98
##   .. ..$ : int  99 100 101
##   .. ..$ : int  102 103 104 105 106
##   .. ..$ : int  107 108
##   .. ..$ : int  109 110
##   .. ..$ : int  111 112
##   .. ..$ : int  113 114 115 116
##   .. ..$ : int  117 118
##   .. ..$ : int  119 120 121 122
##   .. ..$ : int  123 124
##   .. ..$ : int  125 126 127
##   .. ..$ : int  128 129 130 131
##   .. ..$ : int  132 133
##   .. ..$ : int  134 135 136 137
##   .. ..$ : int 138
##   .. ..$ : int  139 140 141 142 143
##   .. ..$ : int  144 145 146
##   .. ..$ : int 147
##   .. ..$ : int  148 149
##   .. ..$ : int  150 151 152
##   .. ..$ : int 153
##   .. ..$ : int  154 155
##   .. ..$ : int  156 157
##   .. ..$ : int  158 159
##   .. ..$ : int  160 161 162
##   .. ..$ : int  163 164
##   .. ..$ : int  165 166 167 168 169
##   .. ..$ : int 170
##   .. ..$ : int  171 172
##   .. ..$ : int  173 174
##   .. ..$ : int  175 176 177 178
##   .. ..$ : int  179 180
##   .. ..$ : int 181
##   .. ..$ : int  182 183
##   .. ..$ : int  184 185 186
##   .. ..$ : int  187 188 189
##   .. ..$ : int  190 191 192
##   .. ..$ : int  193 194 195 196 197
##   .. ..$ : int  198 199
##   .. ..$ : int  200 201 202 203 204
##   .. ..$ : int  205 206 207 208 209
##   .. ..$ : int  210 211 212
##   .. ..$ : int  213 214
##   .. ..$ : int  215 216
##   .. ..$ : int  217 218 219 220
##   .. ..$ : int  221 222 223 224 225
##   .. ..$ : int  226 227 228 229
##   .. ..$ : int  230 231 232
##   .. ..$ : int  233 234 235
##   .. ..$ : int  236 237
##   .. ..$ : int  238 239
##   .. ..$ : int  240 241 242 243
##   .. ..$ : int  244 245
##   .. ..$ : int  246 247 248
##   .. ..$ : int  249 250 251 252 253
##   .. ..$ : int  254 255 256
##   .. ..$ : int  257 258 259 260
##   .. ..$ : int  261 262
##   .. ..$ : int 263
##   .. ..$ : int  264 265
##   .. ..$ : int  266 267 268 269
##   .. ..$ : int  270 271 272
##   .. ..$ : int  273 274 275 276
##   .. ..$ : int  277 278 279 280
##   .. ..$ : int  281 282
##   .. ..$ : int  283 284 285
##   .. ..$ : int 286
##   .. ..$ : int 287
##   .. ..$ : int  288 289 290 291 292
##   .. .. [list output truncated]
##   ..- attr(*, ".drop")= logi TRUE
schoolData <- studentData %>%
    group_by(schoolid) %>%
    summarize_at(vars(mathgain, mathprep, housepov, yearstea), mean, na.rm=TRUE)
str(schoolData)
## Classes 'tbl_df', 'tbl' and 'data.frame':    107 obs. of  5 variables:
##  $ schoolid: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ mathgain: num  59.6 65 88.9 35.2 60.2 ...
##  $ mathprep: num  2.91 2.35 2.95 2 3.75 ...
##  $ housepov: num  0.082 0.082 0.086 0.365 0.511 0.044 0.148 0.085 0.537 0.346 ...
##  $ yearstea: num  1.73 6.02 14.93 22 3 ...
# First, plot the hosepov and mathgain at the classroom-level from the classData data.frame
ggplot(data = classData, aes(x = housepov, y = mathgain)) +
    geom_point() +
    geom_smooth(method = 'lm')

# Second, plot the hosepov and mathgain at the school-level from the schoolData data.frame
ggplot(data = schoolData, aes(x = housepov, y = mathgain)) +
    geom_point() +
    geom_smooth(method = 'lm')

# Third, compare your liner regression results from the previous expercise to the two new models
summary( lm(mathgain ~ housepov, data = studentData)) ## student-level data
## 
## Call:
## lm(formula = mathgain ~ housepov, data = studentData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -168.226  -22.222   -1.306   19.763  195.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   56.937      1.674   34.02   <2e-16 ***
## housepov       3.531      7.515    0.47    0.639    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.63 on 1188 degrees of freedom
## Multiple R-squared:  0.0001858,  Adjusted R-squared:  -0.0006558 
## F-statistic: 0.2208 on 1 and 1188 DF,  p-value: 0.6385
summary( lm(mathgain ~ housepov, data = classData)) ## class-level data
## 
## Call:
## lm(formula = mathgain ~ housepov, data = classData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.479 -14.444  -1.447  13.151 156.516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   58.160      2.542  22.879   <2e-16 ***
## housepov      -3.236     10.835  -0.299    0.765    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.14 on 310 degrees of freedom
## Multiple R-squared:  0.0002876,  Adjusted R-squared:  -0.002937 
## F-statistic: 0.08918 on 1 and 310 DF,  p-value: 0.7654
summary( lm(mathgain ~ housepov, data = schoolData)) ## school-level data
## 
## Call:
## lm(formula = mathgain ~ housepov, data = schoolData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.660  -9.947  -2.494   9.546  41.445 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   59.338      2.624  22.616   <2e-16 ***
## housepov     -11.948     10.987  -1.087    0.279    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.8 on 105 degrees of freedom
## Multiple R-squared:  0.01114,    Adjusted R-squared:  0.00172 
## F-statistic: 1.183 on 1 and 105 DF,  p-value: 0.2793
# Plot the means of your data, predictor is your x-variable, response is your y-variable, and intDemo is your data.frame
intDemo <- data.frame(predictor=factor(c(rep("a", 5), rep("b", 5), rep("c", 5))), 
                      response=c(-1.207, 0.277, 1.084, -2.346, 0.429, 5.759, 4.138, 4.18, 4.153, 3.665, 9.046, 8.003, 8.447, 10.129, 11.919)
                      )
str(intDemo)
## 'data.frame':    15 obs. of  2 variables:
##  $ predictor: Factor w/ 3 levels "a","b","c": 1 1 1 1 1 2 2 2 2 2 ...
##  $ response : num  -1.207 0.277 1.084 -2.346 0.429 ...
ggIntDemo <- ggplot(intDemo, aes(x = predictor, y = response) ) +
    geom_point() +
    theme_minimal() + stat_summary(fun.y = "mean", color = "red",
                                   size = 3, geom = "point") +
    xlab("Intercept groups")
print(ggIntDemo)

# Fit a linear model to your data where response is "predicted by"(~) predictor
intModel <- lm( response ~ predictor - 1 , data = intDemo)
summary(intModel)
## 
## Call:
## lm(formula = response ~ predictor - 1, data = intDemo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9934 -0.7842 -0.2260  0.7056  2.4102 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## predictora  -0.3526     0.5794  -0.609    0.554    
## predictorb   4.3790     0.5794   7.558 6.69e-06 ***
## predictorc   9.5088     0.5794  16.412 1.38e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.296 on 12 degrees of freedom
## Multiple R-squared:  0.9646, Adjusted R-squared:  0.9557 
## F-statistic:   109 on 3 and 12 DF,  p-value: 5.696e-09
extractAndPlotResults <- function(intModel){
    intCoefPlot <- broom::tidy(intModel)
    intCoefPlot$term <- factor(gsub("predictor", "", intCoefPlot$term))

    plotOut <- ggIntDemo + geom_point(data = intCoefPlot,
                           aes(x = term, y = estimate),
                           position = position_dodge(width = 0.4),
                           color = 'blue', size = 8, alpha = 0.25)
    print(plotOut)
}


# Run the next code that extracts out the model's coeffiecents and plots them 
extractAndPlotResults(intModel)

multIntDemo <- data.frame(group=factor(c(rep("a", 5), rep("b", 5), rep("c", 5))), 
                          x=rep(0:4, times=3), 
                          intercept=c(4.11, -1.69, 1.09, 1.9, 1.21, 4.63, 10.29, 4.67, 12.06, 4.78, 15.22, 19.15, 4.44, 8.88, 9.47), 
                          response=c(4.11, 2.31, 9.09, 13.9, 17.21, 4.63, 14.29, 12.67, 24.06, 20.78, 15.22, 23.15, 12.44, 20.88, 25.47)
                          )
str(multIntDemo)
## 'data.frame':    15 obs. of  4 variables:
##  $ group    : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 2 2 2 2 2 ...
##  $ x        : int  0 1 2 3 4 0 1 2 3 4 ...
##  $ intercept: num  4.11 -1.69 1.09 1.9 1.21 ...
##  $ response : num  4.11 2.31 9.09 13.9 17.21 ...
plot_output1 <- function(out1){
    ggmultIntgDemo <- ggplot( multIntDemo, aes(x = x, y = response) ) +
        geom_point(aes(color = group)) +
        theme_minimal() +
        scale_color_manual(values = c("blue", "black", "red")) +
        stat_smooth(method = 'lm', fill = NA, color = 'orange', size = 3)
    print(ggmultIntgDemo)
}

plot_output2 <- function(out2){
    out2Tidy <- broom::tidy(out2)
    out2Tidy$term <- gsub("group", "", out2Tidy$term)
    out2Plot <- data.frame(group = pull(out2Tidy[ -1, 1]),
                           slope = pull(out2Tidy[ 1, 2]),
                           intercept = pull(out2Tidy[ -1, 2])
                           )
    ggmultIntgDemo2 <- ggplot( multIntDemo, aes(x = x, y = response) ) +
        geom_point(aes(color = group))+
        theme_minimal() +
        scale_color_manual(values = c("blue", "black", "red")) +
        geom_abline(data = out2Plot,
                    aes(intercept = intercept, slope = slope, color = group))
    print(ggmultIntgDemo2)
}

plot_output3 <- function(out3){
    ggmultIntgDemo3 <- ggplot( multIntDemo, aes(x = x, y = response) ) +
        geom_point(aes(color = group)) +
        theme_minimal() +
        scale_color_manual(values = c("blue", "black", "red")) +
        stat_smooth(method = 'lm', aes(color = group), fill = NA)
    print(ggmultIntgDemo3)
}

# First, run a model without considering different intercept for each group
out1 <- lm( response ~ x, data=multIntDemo )
summary(out1)
## 
## Call:
## lm(formula = response ~ x, data = multIntDemo)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.101 -4.021 -2.011  3.590 11.739 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    8.141      2.615   3.113  0.00824 **
## x              3.270      1.068   3.062  0.00908 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.848 on 13 degrees of freedom
## Multiple R-squared:  0.4191, Adjusted R-squared:  0.3744 
## F-statistic: 9.378 on 1 and 13 DF,  p-value: 0.009081
plot_output1(out1)

# Considering same slope but different intercepts
out2 <- lm( response ~ x + group - 1, data=multIntDemo )
summary(out2)
## 
## Call:
## lm(formula = response ~ x + group - 1, data = multIntDemo)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.992 -2.219 -0.234  1.810  6.988 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## x        3.2697     0.7516   4.350 0.001155 ** 
## groupa   2.7847     2.3767   1.172 0.266085    
## groupb   8.7467     2.3767   3.680 0.003625 ** 
## groupc  12.8927     2.3767   5.425 0.000209 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.117 on 11 degrees of freedom
## Multiple R-squared:  0.9534, Adjusted R-squared:  0.9364 
## F-statistic: 56.23 on 4 and 11 DF,  p-value: 2.97e-07
plot_output2(out2)

# Consdering different slope and intercept for each group (i.e., an interaction)
out3 <- lm( response ~ x + group - 1 + x:group, multIntDemo)
summary(out3)
## 
## Call:
## lm(formula = response ~ x + group - 1 + x:group, data = multIntDemo)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.992 -2.429 -0.234  2.368  5.541 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## x           3.779      1.308   2.888 0.017941 *  
## groupa      1.766      3.205   0.551 0.595053    
## groupb      6.872      3.205   2.144 0.060621 .  
## groupc     15.786      3.205   4.925 0.000819 ***
## x:groupb    0.428      1.851   0.231 0.822263    
## x:groupc   -1.956      1.851  -1.057 0.318050    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.138 on 9 degrees of freedom
## Multiple R-squared:  0.9615, Adjusted R-squared:  0.9358 
## F-statistic: 37.42 on 6 and 9 DF,  p-value: 7.263e-06
plot_output3(out3)

multIntDemo$intercept <- c(-0.87, 3.35, 1.25, 0.88, -1.05, 4.55, 1.22, 3.34, 1.26, 3.75, 7.71, 9.59, 2.28, 1.9, 13.35)
multIntDemo$response <- c(-0.87, 6.35, 7.25, 9.88, 10.95, 4.55, 4.22, 9.34, 10.26, 15.75, 7.71, 12.59, 8.28, 10.9, 25.35)

# Run model
outLmer <- lme4::lmer( response ~ x + ( 1 | group), multIntDemo)

# Look at model outputs 
summary( outLmer )
## Linear mixed model fit by REML ['lmerMod']
## Formula: response ~ x + (1 | group)
##    Data: multIntDemo
## 
## REML criterion at convergence: 76.9
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.31584 -0.61104 -0.01592  0.45125  2.19118 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  group    (Intercept)  7.98    2.825   
##  Residual             10.71    3.272   
## Number of obs: 15, groups:  group, 3
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   3.5540     2.1913   1.622
## x             2.9733     0.5975   4.977
## 
## Correlation of Fixed Effects:
##   (Intr)
## x -0.545
broom::tidy( outLmer )
## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## # A tibble: 4 x 5
##   term                    estimate std.error statistic group   
##   <chr>                      <dbl>     <dbl>     <dbl> <chr>   
## 1 (Intercept)                 3.55     2.19       1.62 fixed   
## 2 x                           2.97     0.597      4.98 fixed   
## 3 sd_(Intercept).group        2.82    NA         NA    group   
## 4 sd_Observation.Residual     3.27    NA         NA    Residual
extractAndPlotOutput <- function(outLmer, slope=3){
    multIntDemo$lmerPredict <- predict(outLmer)
    ggmultIntgDemo2 <- ggplot( multIntDemo, aes(x = x, y = response) ) +
        geom_point(aes(color = group))+
        theme_minimal() +
        scale_color_manual(values = c("blue", "black", "red")) +
        geom_abline(data = multIntDemo,
                    aes(intercept = intercept, slope = slope, color = group))
    outPlot <-  ggmultIntgDemo2 +
                geom_line( data =  multIntDemo,
                      aes(x = x, y = lmerPredict, color = group),
                      linetype = 2)
    print(outPlot)
}


# Extract predictor variables and plot
extractAndPlotOutput(outLmer)

# Random effect slopes
multIntDemo$response <- c(-0.72, 1.5, 4.81, 6.61, 13.62, 10.21, 9.64, 11.91, 16.39, 16.97, 8.76, 14.79, 15.83, 15.27, 17.36)
multIntDemo$intercept <- c(-0.72, -1.5, -1.19, -2.39, 1.62, 10.21, 6.64, 5.91, 7.39, 4.97, 8.76, 11.79, 9.83, 6.27, 5.36)

outLmer2 <- lme4::lmer( response ~ ( x|group ), multIntDemo)
## boundary (singular) fit: see ?isSingular
summary(outLmer2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: response ~ (x | group)
##    Data: multIntDemo
## 
## REML criterion at convergence: 69.7
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.56739 -0.54097 -0.06276  0.75132  1.27928 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  group    (Intercept) 273.766  16.546        
##           x             6.090   2.468   -1.00
##  Residual               2.467   1.571        
## Number of obs: 15, groups:  group, 3
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   21.676      1.383   15.67
## convergence code: 0
## boundary (singular) fit: see ?isSingular
broom::tidy(outLmer2)
## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## # A tibble: 5 x 5
##   term                    estimate std.error statistic group   
##   <chr>                      <dbl>     <dbl>     <dbl> <chr>   
## 1 (Intercept)                21.7       1.38      15.7 fixed   
## 2 sd_(Intercept).group       16.5      NA         NA   group   
## 3 sd_x.group                  2.47     NA         NA   group   
## 4 cor_(Intercept).x.group    -1        NA         NA   group   
## 5 sd_Observation.Residual     1.57     NA         NA   Residual
plotOutput <- function(outLmer2){
    multIntDemo$lmerPredict2 <- predict(outLmer2)
    ggmultIntgDemo3 <- ggplot( multIntDemo, aes(x = x, y = response) ) +
        geom_point(aes(color = group)) +
        theme_minimal() +
        scale_color_manual(values = c("blue", "black", "red")) +
        stat_smooth(method = 'lm', aes(color = group), fill = NA)
    plotOut <- ggmultIntgDemo3 +
            geom_line( data =  multIntDemo,
                      aes(x = x, y = lmerPredict2, color = group),
                      linetype = 2)
    print(plotOut)
}


# Extract and plot
plotOutput(outLmer2)

# Mixed effect model
lmerModel <- lme4::lmer(mathgain ~ sex + 
                  mathprep + mathknow + (1|classid) +
                  (1|schoolid), data = studentData, na.action = "na.omit",
                  REML = TRUE)
summary(lmerModel)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## mathgain ~ sex + mathprep + mathknow + (1 | classid) + (1 | schoolid)
##    Data: studentData
## 
## REML criterion at convergence: 10677.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.3203 -0.6146 -0.0294  0.5467  5.5331 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  classid  (Intercept)  103.57  10.177  
##  schoolid (Intercept)   85.44   9.244  
##  Residual             1019.47  31.929  
## Number of obs: 1081, groups:  classid, 285; schoolid, 105
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   52.250      3.838  13.613
## sexfemale     -1.526      2.041  -0.747
## mathprep       2.426      1.298   1.869
## mathknow       2.405      1.299   1.851
## 
## Correlation of Fixed Effects:
##           (Intr) sexfml mthprp
## sexfemale -0.268              
## mathprep  -0.878  0.001       
## mathknow  -0.003  0.011  0.005
extractAndPlot <- function(lmerModel){
    modelOutPlot <- broom::tidy(lmerModel, conf.int = TRUE)
    modelOutPlot <- modelOutPlot[ modelOutPlot$group =="fixed" &
                               modelOutPlot$term != "(Intercept)", ]
    plotOut <- ggplot(modelOutPlot, aes(x = term, y = estimate,
                             ymin = conf.low,
                             ymax = conf.high)) +
            theme_minimal() +
            geom_hline(yintercept = 0.0, color = 'red', size = 2.0) +
            geom_point() +
            geom_linerange() + coord_flip()
    print(plotOut)
}


# Extract and plot 
extractAndPlot(lmerModel)
## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector


Chapter 2 - Linear Mixed-Effect Models

Linear mixed effect model - Birth rates data:

  • Small populations are highly sensitive to stochastic effects - if the mean is 1, a group of 5 might have 0 or 10
  • Questions about how counties may impact birth rates, over and above other demographic factors
    • Example of plotting birth rate vs. county - will see both the highest and lowest birth rates in the smallest counties
  • Random effect syntax for the lmer model includes
    • (1 | group) - random intercept with fixed mean
    • (1 | g1/g2) - intercepts vary among g1 and g2 within g2
    • (1 | g1) + (1 | g2) - random intercepts for two variables
    • x + (x | g) - correlated random slope and intercept
    • x + (x || g) - uncorrelated random slope and intercept
    • See lme4 documentation for additional details

Understanding and reporting the outputs of lmer:

  • The output from lmer is similar to the output from lm, but with some key differences - if using print(), will see
    • The method used is REML - restricted maximum likelihood - which tends to solve better than maximum likelihood for these problems
    • There is an REML convergence criteria, which can be a helpful diagnostic
    • Can see the standard deviations for both the state and the residual, along with the number of observations
    • Get the fixed effects coefficients in a similar form as lm()
  • The summary() call on lmer produces several additional outputs
    • Residuals summary
    • Fixed effects estimates include SE and t-values (but not p-values)
    • Correlations of fixed effects
  • Can grab only the fixed effects using fixef(myLMERObject)
    • Can grab only the random effects using ranef(myLMERObject), though these will not have confidence intervals
    • The random effects confidence intervals could be estimated using bootstrapping or Bayesian methods per the author of lme4 - but actual random effects are just unobserved random variables rather than parameters
  • Can grab only the confidence intervals using confint(myLMERObject)
  • Need to be careful in reporting the results - figures vs. tables vs. in-line descriptions

Statistical inference with Maryland crime data:

  • The Maryland crime data is available on data.gov - interesting for many public and private purposes
  • The null hypothsis test can be used with LMER - frequentist approach
    • By default, lmer does not provide p-values, as there is ongoing debate as to the degrees of freedom and impact on reported results
    • Can use lmerTest package to calculate and report on the p-values
  • Can use ANOVA to look at the variability explained by one model versus another model, and the associated degrees of freedom needed

Example code includes:

# Read in births data
rawBirths <- read.csv("./RInputFiles/countyBirthsDataUse.csv")
countyBirthsData <- rawBirths
str(countyBirthsData)
## 'data.frame':    580 obs. of  8 variables:
##  $ X                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Year              : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ TotalPopulation   : int  203709 115620 103057 104173 660367 156993 353089 415395 226519 119565 ...
##  $ BirthRate         : num  11.5 12.1 11.8 12.4 13.3 ...
##  $ AverageBirthWeight: num  3261 3209 3239 3207 3177 ...
##  $ AverageAgeofMother: num  27.5 26.3 25.8 26.9 27.9 ...
##  $ CountyName        : Factor w/ 472 levels "Ada","Adams",..: 22 64 141 189 200 229 248 273 278 279 ...
##  $ State             : Factor w/ 50 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
# First, build a lmer with state as a random effect. Then look at the model's summary and the plot of residuals. 
birthRateStateModel <- lme4::lmer(BirthRate ~ (1|State), data=countyBirthsData)
summary(birthRateStateModel)
## Linear mixed model fit by REML ['lmerMod']
## Formula: BirthRate ~ (1 | State)
##    Data: countyBirthsData
## 
## REML criterion at convergence: 2411
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.7957 -0.6056 -0.1063  0.5211  5.5948 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  State    (Intercept) 1.899    1.378   
##  Residual             3.256    1.804   
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  12.3362     0.2216   55.67
plot(birthRateStateModel)

# Next, plot the predicted values from the model ontop of the plot shown during the video.
countyBirthsData$birthPredictState <- predict(birthRateStateModel, countyBirthsData)
ggplot() + theme_minimal() +
    geom_point(data =countyBirthsData, aes(x = TotalPopulation, y = BirthRate)) + 
    geom_point(data = countyBirthsData, aes(x = TotalPopulation, y = birthPredictState),
               color = 'blue', alpha = 0.5
               )
## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_point).

# Include the AverageAgeofMother as a fixed effect within the lmer and state as a random effect
ageMotherModel <- lme4::lmer( BirthRate ~ AverageAgeofMother + (1|State), data=countyBirthsData)
summary(ageMotherModel)
## Linear mixed model fit by REML ['lmerMod']
## Formula: BirthRate ~ AverageAgeofMother + (1 | State)
##    Data: countyBirthsData
## 
## REML criterion at convergence: 2347.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9602 -0.6086 -0.1042  0.5144  5.2686 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  State    (Intercept) 1.562    1.250   
##  Residual             2.920    1.709   
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##                    Estimate Std. Error t value
## (Intercept)        27.57033    1.81801  15.165
## AverageAgeofMother -0.53549    0.06349  -8.434
## 
## Correlation of Fixed Effects:
##             (Intr)
## AvrgAgfMthr -0.994
# Compare the random-effect model to the linear effect model 
summary(lm(BirthRate ~ AverageAgeofMother, data = countyBirthsData))
## 
## Call:
## lm(formula = BirthRate ~ AverageAgeofMother, data = countyBirthsData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8304 -1.3126 -0.1795  1.2198  8.7327 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        29.06637    1.83374  15.851   <2e-16 ***
## AverageAgeofMother -0.59380    0.06441  -9.219   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.065 on 576 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.1286, Adjusted R-squared:  0.1271 
## F-statistic: 84.99 on 1 and 576 DF,  p-value: < 2.2e-16
# Include the AverageAgeofMother as a correlated random-effect slope parameter
ageMotherModelRandomCorrelated <- lme4::lmer(BirthRate ~ AverageAgeofMother + (AverageAgeofMother|State),
                       countyBirthsData)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.0133555
## (tol = 0.002, component 1)
summary(ageMotherModelRandomCorrelated)
## Linear mixed model fit by REML ['lmerMod']
## Formula: BirthRate ~ AverageAgeofMother + (AverageAgeofMother | State)
##    Data: countyBirthsData
## 
## REML criterion at convergence: 2337.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8402 -0.5965 -0.1132  0.5233  5.1817 
## 
## Random effects:
##  Groups   Name               Variance Std.Dev. Corr 
##  State    (Intercept)        78.33144 8.8505        
##           AverageAgeofMother  0.08433 0.2904   -0.99
##  Residual                     2.80345 1.6744        
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##                    Estimate Std. Error t value
## (Intercept)        27.21961    2.41010  11.294
## AverageAgeofMother -0.52344    0.08293  -6.312
## 
## Correlation of Fixed Effects:
##             (Intr)
## AvrgAgfMthr -0.997
## convergence code: 0
## Model failed to converge with max|grad| = 0.0133555 (tol = 0.002, component 1)
# Include the AverageAgeofMother as a correlated random-effect slope parameter
ageMotherModelRandomUncorrelated <- lme4::lmer(BirthRate ~ AverageAgeofMother + 
                                                    (AverageAgeofMother || State), data=countyBirthsData
                                               )
## boundary (singular) fit: see ?isSingular
summary(ageMotherModelRandomUncorrelated)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## BirthRate ~ AverageAgeofMother + ((1 | State) + (0 + AverageAgeofMother |  
##     State))
##    Data: countyBirthsData
## 
## REML criterion at convergence: 2347.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9602 -0.6086 -0.1042  0.5144  5.2686 
## 
## Random effects:
##  Groups   Name               Variance Std.Dev.
##  State    (Intercept)        1.562    1.250   
##  State.1  AverageAgeofMother 0.000    0.000   
##  Residual                    2.920    1.709   
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##                    Estimate Std. Error t value
## (Intercept)        27.57031    1.81801  15.165
## AverageAgeofMother -0.53549    0.06349  -8.434
## 
## Correlation of Fixed Effects:
##             (Intr)
## AvrgAgfMthr -0.994
## convergence code: 0
## boundary (singular) fit: see ?isSingular
out <- ageMotherModelRandomUncorrelated

# Extract the fixed-effect coefficients
lme4::fixef(out)
##        (Intercept) AverageAgeofMother 
##         27.5703059         -0.5354876
# Extract the random-effect coefficients
lme4::ranef(out)
## $State
##    (Intercept) AverageAgeofMother
## AK  1.03554361                  0
## AL -0.52501630                  0
## AR  0.48024018                  0
## AZ -1.04095779                  0
## CA  0.50530282                  0
## CO  0.09585291                  0
## CT -1.91641428                  0
## DC  0.96034296                  0
## DE -0.38939548                  0
## FL -1.87441671                  0
## GA  0.39776296                  0
## HI  0.08513460                  0
## IA  0.96281025                  0
## ID  1.17380179                  0
## IL -0.12739802                  0
## IN -0.32655768                  0
## KS  0.85651904                  0
## KY  0.64872241                  0
## LA  1.04438181                  0
## MA -1.40084157                  0
## MD  0.10842594                  0
## ME -1.63524235                  0
## MI -1.13798940                  0
## MN  0.93266949                  0
## MO  0.07081678                  0
## MS -0.21398117                  0
## MT -0.13190987                  0
## NC -0.28681725                  0
## ND  0.99852760                  0
## NE  1.49394698                  0
## NH -1.45444986                  0
## NJ -0.30090199                  0
## NM -0.69755039                  0
## NV  0.09013066                  0
## NY -0.58164079                  0
## OH -1.07391197                  0
## OK  0.77998608                  0
## OR -0.75846975                  0
## PA -1.59333857                  0
## RI -1.36399831                  0
## SC -0.59295913                  0
## SD  1.35146914                  0
## TN -0.13513429                  0
## TX  1.70872778                  0
## UT  3.66063407                  0
## VA  1.59188509                  0
## VT -0.51108138                  0
## WA  0.23008247                  0
## WI -0.51647561                  0
## WV -0.67686749                  0
## 
## with conditional variances for "State"
# Estimate the confidence intervals 
(ciOut <- confint(out))
## Computing profile confidence intervals ...
##                         2.5 %      97.5 %
## .sig01              0.0000000  1.61214393
## .sig02              0.0000000  0.05033958
## .sigma              1.6093152  1.81561031
## (Intercept)        24.0121844 31.14669045
## AverageAgeofMother -0.6605319 -0.41123093
# Technical note: Extracting out the regression coefficients from lmer is tricky (see discussion between the lmer and broom authors development)
# Extract out the parameter estimates and confidence intervals and manipulate the data
dataPlot <- data.frame(cbind( lme4::fixef(out), ciOut[ 4:5, ]))
rownames(dataPlot)[1] <- "Intercept"
colnames(dataPlot) <- c("mean", "l95", "u95")
dataPlot$parameter <- rownames(dataPlot)

# Print the new dataframe
print(dataPlot)
##                          mean        l95        u95          parameter
## Intercept          27.5703059 24.0121844 31.1466905          Intercept
## AverageAgeofMother -0.5354876 -0.6605319 -0.4112309 AverageAgeofMother
# Plot the results using ggplot2
ggplot(dataPlot, aes(x = parameter, y = mean,
                     ymin = l95, ymax = u95)) +
    geom_hline( yintercept = 0, color = 'red' ) +
    geom_linerange() + geom_point() + coord_flip() + theme_minimal()

# Read in crime data
rawCrime <- read.csv("./RInputFiles/MDCrime.csv")
MDCrime <- rawCrime
str(MDCrime)
## 'data.frame':    192 obs. of  5 variables:
##  $ X     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ County: Factor w/ 24 levels "ALLEGANY","ANNE ARUNDEL",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ Year  : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ Crime : int  3167 10871 5713 257 149 374 490 729 181 752 ...
##  $ Year2 : int  0 0 0 0 0 0 0 0 0 0 ...
plot1 <- ggplot(data = MDCrime, aes(x = Year, y = Crime, group = County)) +
    geom_line() + theme_minimal() +
    ylab("Major crimes reported per county")
print(plot1)

plot1 + geom_smooth(method = 'lm')

# Null hypothesis testing uses p-values to see if a variable is "significant"
# Recently, the abuse and overuse of null hypothesis testing and p-values has caused the American Statistical Association to issue a statement about the use of p-values
# Because of these criticisms and other numerical challenges, Doug Bates (the creator of the lme4 package) does not include p-values as part of his package
# However, you may still want to estimate p-values, because p-values are sill commonly used. Several packages exist, including the lmerTest package
# https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf

# Load lmerTest
# library(lmerTest)

# Fit the model with Year as both a fixed and random-effect
lme4::lmer(Crime ~ Year + (1 + Year | County) , data = MDCrime)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : unable to evaluate scaled gradient
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge: degenerate Hessian with 1
## negative eigenvalues
## Linear mixed model fit by REML ['lmerMod']
## Formula: Crime ~ Year + (1 + Year | County)
##    Data: MDCrime
## REML criterion at convergence: 2892.018
## Random effects:
##  Groups   Name        Std.Dev. Corr 
##  County   (Intercept) 386.29        
##           Year          1.34   -0.84
##  Residual             328.32        
## Number of obs: 192, groups:  County, 24
## Fixed Effects:
## (Intercept)         Year  
##   136642.97       -67.33  
## convergence code 0; 2 optimizer warnings; 0 lme4 warnings
# Fit the model with Year2 rather than Year
out <- lme4::lmer(Crime ~ Year2 + (1 + Year2 | County) , data = MDCrime)

# Examine the model's output
summary(out)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Crime ~ Year2 + (1 + Year2 | County)
##    Data: MDCrime
## 
## REML criterion at convergence: 2535.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8080 -0.2235 -0.0390  0.2837  3.0767 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  County   (Intercept) 7584514  2754.00       
##           Year2         16940   130.15  -0.91
##  Residual                8425    91.79       
## Number of obs: 192, groups:  County, 24
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  1577.28     562.29   2.805
## Year2         -67.33      26.72  -2.519
## 
## Correlation of Fixed Effects:
##       (Intr)
## Year2 -0.906
## Build the Null model with only County as a random-effect
null_model <- lme4::lmer(Crime ~ (1 | County) , data = MDCrime)

## Build the Year2 model with Year2 as a fixed and random slope and County as the random-effect
year_model <- lme4::lmer(Crime ~ Year2 + (1 + Year2 | County) , data = MDCrime)

## Compare the two models using an anova
anova(null_model, year_model)
## refitting model(s) with ML (instead of REML)
## Data: MDCrime
## Models:
## null_model: Crime ~ (1 | County)
## year_model: Crime ~ Year2 + (1 + Year2 | County)
##            Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## null_model  3 2954.4 2964.2 -1474.2   2948.4                             
## year_model  6 2568.9 2588.4 -1278.4   2556.9 391.52      3  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Chapter 3 - Generalized Linear Mixed-Effect Models

Crash course on GLMs - relaxing the assumptions around normality of the residuals:

  • Non-normal data can be transformed using arcsin or the like
  • However, with advances in methodology, it is possible to more directly model the data using binomial and poisson distributions
  • The basic glm call is glm(y ~ x, family=“”) # default is family=“gaussian”, which same as the lm()
  • The Poisson distribution is frequently best for count data, such as website visitors per hour - mean equals variance (generally best for small counts less than 30; can use normals for large counts)
  • For logistic regression, data can be entered in any of three formats
    • Binary (y=0 or 1) - glm(y ~ x, family=“binomial”)
    • Wilkinson-Rogers - glm(cbind(success, failure) ~ x, family=“binomial”)
    • Weighted - glm(y ~ x, weights=weights, family=“binomial”)
    • These methods differ primarily in the degrees of freedom (and thus deviance)

Binomial data - modeling data with only two outcomes:

  • Traditional method for analysis includes looking at proportion of successes
  • The GLM allows for direct looks at the data - logistic regression (logit)
  • Binomial data can be fit using glmer(y ~ x + (1/group), family=“error term”)
  • The regression coefficients can be difficult to explain, sometimes leading to the use of odds ratios instead
    • The odds ratio of 2.0 would mean 2:1 odds for that specific group

Count data:

  • Examples like number of events per hour (website hits) or counts per area (birds)
  • The count data differs from the binomial in that there is no pre-specified upper boundary
  • While Chi-squared is often used for goodness of fit or test of association for count data, the Poisson GLM can be a nice alternative
    • glm(y ~ x, family=“poisson”)
    • glmer(y ~ x + (1|group), family=“poisson”)

Example code includes:

# In this case study, we will be working with simulated dose-response data
# The response is mortality (1) or survival (0) at the end of a study. During this exercise, we will fit a logistic regression using all three methods described in the video
# You have been given two datasets. dfLong has the data in a "long" format with each row corresponding to an observation (i.e., a 0 or 1)
# dfShort has the data in an aggregated format with each row corresponding to a treatment (e.g., 6 successes, 4 failures, number of replicates = 10, proportion = 0.6)

dfLong <- data.frame(dose=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10), 
                     mortality=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1)
                     )
str(dfLong)
## 'data.frame':    120 obs. of  2 variables:
##  $ dose     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mortality: num  0 0 0 0 0 0 0 0 0 0 ...
dfShort <- dfLong %>% 
    group_by(dose) %>%
    summarize(mortality=sum(mortality), nReps=n()) %>%
    mutate(survival=nReps-mortality, mortalityP=mortality/nReps)
dfShort
## # A tibble: 6 x 5
##    dose mortality nReps survival mortalityP
##   <dbl>     <dbl> <int>    <dbl>      <dbl>
## 1  0         0       20    20.0       0    
## 2  2.00      4.00    20    16.0       0.200
## 3  4.00      8.00    20    12.0       0.400
## 4  6.00     10.0     20    10.0       0.500
## 5  8.00     11.0     20     9.00      0.550
## 6 10.0      13.0     20     7.00      0.650
# Fit a glm using data in a long format
fitLong <- glm(mortality ~ dose, data = dfLong, family = "binomial")
summary(fitLong)
## 
## Call:
## glm(formula = mortality ~ dose, family = "binomial", data = dfLong)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5916  -0.8245  -0.4737   1.0440   1.8524  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.13075    0.44532  -4.785 1.71e-06 ***
## dose         0.30663    0.06821   4.495 6.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 159.76  on 119  degrees of freedom
## Residual deviance: 134.71  on 118  degrees of freedom
## AIC: 138.71
## 
## Number of Fisher Scoring iterations: 3
# Fit a glm using data in a short format with two columns
fitShort <- glm( cbind(mortality , survival ) ~ dose , data = dfShort, family = "binomial")
summary(fitShort)
## 
## Call:
## glm(formula = cbind(mortality, survival) ~ dose, family = "binomial", 
##     data = dfShort)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -2.1186   0.2316   1.0698   0.6495  -0.2699  -0.6634  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.13075    0.44537  -4.784 1.72e-06 ***
## dose         0.30663    0.06822   4.495 6.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31.6755  on 5  degrees of freedom
## Residual deviance:  6.6214  on 4  degrees of freedom
## AIC: 27.415
## 
## Number of Fisher Scoring iterations: 4
# Fit a glm using data in a short format with weights
fitShortP <- glm( mortalityP ~ dose , data = dfShort, weights = nReps , family = "binomial")
summary(fitShortP)
## 
## Call:
## glm(formula = mortalityP ~ dose, family = "binomial", data = dfShort, 
##     weights = nReps)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -2.1186   0.2316   1.0698   0.6495  -0.2699  -0.6634  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.13075    0.44537  -4.784 1.72e-06 ***
## dose         0.30663    0.06822   4.495 6.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31.6755  on 5  degrees of freedom
## Residual deviance:  6.6214  on 4  degrees of freedom
## AIC: 27.415
## 
## Number of Fisher Scoring iterations: 4
y <- c(0, 1, 0, 1, 0, 1, 0, 1, 0, 2, 1, 2, 0, 1, 1, 0, 1, 5, 1, 1)
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)


# Fit the linear model
summary(lm(y ~ x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3677 -0.6145 -0.2602  0.4297  3.4805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.15263    0.50312   0.303   0.7651  
## x            0.07594    0.04200   1.808   0.0873 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.083 on 18 degrees of freedom
## Multiple R-squared:  0.1537, Adjusted R-squared:  0.1067 
## F-statistic: 3.269 on 1 and 18 DF,  p-value: 0.08733
# Fit the generalized linear model
summary(glm(y ~ x, family = "poisson"))
## 
## Call:
## glm(formula = y ~ x, family = "poisson")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6389  -0.9726  -0.3115   0.5307   2.1559  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.04267    0.60513  -1.723   0.0849 .
## x            0.08360    0.04256   1.964   0.0495 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 23.589  on 19  degrees of freedom
## Residual deviance: 19.462  on 18  degrees of freedom
## AIC: 52.17
## 
## Number of Fisher Scoring iterations: 5
# Often, we want to "look" at our data and trends in our data
# ggplot2 allows us to add trend lines to our data
# The defult lines are created using a technique called local regression
# However, we can specify different models, including GLMs
# During this exercise, we'll see how to plot a GLM

# Plot the data using jittered points and the default stat_smooth
ggplot(data = dfLong, aes(x = dose, y = mortality)) + 
    geom_jitter(height = 0.05, width = 0.1) +
    stat_smooth(fill = 'pink', color = 'red') 
## `geom_smooth()` using method = 'loess'

# Plot the data using jittered points and the the glm stat_smooth
ggplot(data = dfLong, aes(x = dose, y = mortality)) + 
    geom_jitter(height = 0.05, width = 0.1) +
    stat_smooth(method = 'glm',  method.args = list(family = "binomial"))

# library(lmerTest)

df <- data.frame(dose=rep(rep(c(0, 2, 4, 6, 8, 10), each=20), times=3), 
                 mortality=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1), 
                 replicate=factor(rep(letters[1:3], each=120))
                 )
str(df)
## 'data.frame':    360 obs. of  3 variables:
##  $ dose     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mortality: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ replicate: Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
glmerOut <- lme4::glmer(mortality ~ dose + (1|replicate), family = 'binomial', data = df)
summary(glmerOut)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: mortality ~ dose + (1 | replicate)
##    Data: df
## 
##      AIC      BIC   logLik deviance df.resid 
##    378.1    389.8   -186.0    372.1      357 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3484 -0.6875 -0.3031  0.6413  2.1907 
## 
## Random effects:
##  Groups    Name        Variance  Std.Dev.
##  replicate (Intercept) 6.658e-15 8.16e-08
## Number of obs: 360, groups:  replicate, 3
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.38736    0.27334  -8.734   <2e-16 ***
## dose         0.40948    0.04414   9.276   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## dose -0.884
# library(lmerTest)
# Fit the model and look at its summary 
# modelOut <- lme4::glmer( cbind(Purchases, Pass) ~ friend + ranking + (1|city), data = allData, family = 'binomial')
# summary( modelOut) 

# Compare outputs to a lmer model
# summary(lme4::lmer( Purchases/( Purchases + Pass) ~ friend + ranking + (1|city), data = allData))


# Run the code to see how to calculate odds ratios
# summary(modelOut) 
# exp(fixef(modelOut)[2])
# exp(confint(modelOut)[3, ])


# Load lmerTest
# library(lmerTest)


userGroups <- data.frame(group=factor(rep(rep(LETTERS[1:4], each=10), times=2)), 
                         webpage=factor(rep(c("old", "new"), each=40)), 
                         clicks=c(0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 1, 1, 1, 2, 0, 1, 1, 0, 3, 2, 3, 1, 2, 4, 2, 1, 0, 2, 0, 1, 2, 0, 2, 1, 1, 2, 4, 2, 8, 1, 1, 1, 2, 1, 1, 0, 0, 3, 0, 1, 4, 1, 2, 0, 1, 1, 0, 0, 3, 2, 0, 3, 1, 2, 2, 0, 2, 3, 1, 3, 2, 4, 4, 2, 1, 5, 2)
                         )
str(userGroups)
## 'data.frame':    80 obs. of  3 variables:
##  $ group  : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
##  $ webpage: Factor w/ 2 levels "new","old": 2 2 2 2 2 2 2 2 2 2 ...
##  $ clicks : num  0 0 0 0 0 0 2 0 0 0 ...
# Fit a Poisson glmer
summary( lme4::glmer(clicks ~ webpage + (1|group), family = 'poisson', data = userGroups))
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: clicks ~ webpage + (1 | group)
##    Data: userGroups
## 
##      AIC      BIC   logLik deviance df.resid 
##    255.5    262.6   -124.7    249.5       77 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.3999 -0.9104 -0.2340  0.4978  5.6126 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  group  (Intercept) 0.07093  0.2663  
## Number of obs: 80, groups:  group, 4
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.5524     0.1797   3.074  0.00211 **
## webpageold   -0.5155     0.1920  -2.685  0.00726 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr)
## webpageold -0.400
# library(lmerTest)


rawIL <- read.csv("./RInputFiles/ILData.csv")
ILdata <- rawIL
str(ILdata)
## 'data.frame':    1808 obs. of  4 variables:
##  $ age   : Factor w/ 4 levels "15_19","20_24",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ county: Factor w/ 47 levels "ALEXANDER","BROWN",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ count : int  0 0 0 5 0 7 0 4 0 12 ...
# Age goes before year
modelOut <- lme4::glmer(count ~ age + year + (year|county), family = 'poisson', data = ILdata)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge with max|grad| = 0.00144074 (tol =
## 0.001, component 1)
summary(modelOut)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: count ~ age + year + (year | county)
##    Data: ILdata
## 
##      AIC      BIC   logLik deviance df.resid 
##   3215.6   3259.6  -1599.8   3199.6     1800 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4511 -0.0151 -0.0056 -0.0022  4.0053 
## 
## Random effects:
##  Groups Name        Variance Std.Dev. Corr 
##  county (Intercept) 129.9459 11.3994       
##         year          0.0648  0.2546  -1.00
## Number of obs: 1808, groups:  county, 47
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.76258    2.13022  -5.052 4.36e-07 ***
## age20_24     -0.04152    0.03690  -1.125    0.261    
## age25_29     -1.16262    0.05290 -21.976  < 2e-16 ***
## age30_34     -2.28278    0.08487 -26.898  < 2e-16 ***
## year          0.32708    0.25422   1.287    0.198    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##          (Intr) a20_24 a25_29 a30_34
## age20_24 -0.008                     
## age25_29 -0.006  0.341              
## age30_34 -0.004  0.213  0.148       
## year     -0.764  0.000  0.000  0.000
## convergence code: 0
## Model failed to converge with max|grad| = 0.00144074 (tol = 0.001, component 1)
# Extract out fixed effects
lme4::fixef(modelOut)
##  (Intercept)     age20_24     age25_29     age30_34         year 
## -10.76258497  -0.04151848  -1.16262225  -2.28277972   0.32708039
# Extract out random effects 
lme4::ranef(modelOut)
## $county
##            (Intercept)         year
## ALEXANDER   -0.2847724  0.006331741
## BROWN       -0.2847724  0.006331741
## CALHOUN     -0.2847724  0.006331741
## CARROLL     12.2418514 -0.260423999
## CASS        -0.2847724  0.006331741
## CLARK       12.2137668 -0.268553354
## CLAY        -0.2847724  0.006331741
## CRAWFORD    12.5037407 -0.265752695
## CUMBERLAND  -0.2847724  0.006331741
## DE WITT     12.7456078 -0.277675211
## DOUGLAS     13.0751590 -0.306329903
## EDGAR       12.3642794 -0.283606045
## EDWARDS     -0.2847724  0.006331741
## FAYETTE     12.8094530 -0.273060474
## FORD        -0.2847724  0.006331741
## GALLATIN    -0.2847724  0.006331741
## GREENE      -0.2847724  0.006331741
## HAMILTON    -0.2847724  0.006331741
## HANCOCK     12.8581265 -0.305650287
## HARDIN      -0.2847724  0.006331741
## HENDERSON   -0.2847724  0.006331741
## IROQUOIS    13.1616741 -0.311372907
## JASPER      -0.2847724  0.006331741
## JERSEY      12.9202747 -0.272284048
## JO DAVIESS  12.7409389 -0.289747791
## JOHNSON     -0.2847724  0.006331741
## LAWRENCE    12.3713561 -0.268571236
## MARSHALL    -0.2847724  0.006331741
## MASON       -0.2847724  0.006331741
## MENARD      -0.2180916  0.004849989
## MERCER      12.7534193 -0.271678572
## MOULTRIE    -0.2180916  0.004849989
## PIATT       12.5653132 -0.296687752
## PIKE        12.5310614 -0.259211299
## POPE        -0.2180916  0.004849989
## PULASKI     -0.2180916  0.004849989
## PUTNAM      -0.2180916  0.004849989
## RICHLAND    12.0350865 -0.273928951
## SCHUYLER    -0.2180916  0.004849989
## SCOTT       -0.2180916  0.004849989
## SHELBY      12.5183293 -0.283472292
## STARK       -0.2180916  0.004849989
## UNION       13.1465272 -0.308673332
## WABASH      -0.2180916  0.004849989
## WASHINGTON  -0.2180916  0.004849989
## WAYNE       12.1148896 -0.253234752
## WHITE       -0.2180916  0.004849989
# Run code to see one method for plotting the data
ggplot(data = ILdata, aes(x = year, y = count, group = county)) +
    geom_line() +
    facet_grid(age ~ . ) +
    stat_smooth( method = 'glm',
                method.args = list( family = "poisson"), se = FALSE,
                alpha = 0.5) +
    theme_minimal()


Chapter 4 - Repeated Measures

An introduction to repeated measures:

  • Sampling the same thing over time is a repeated measure, a specific example of a mixed effects model
    • Follow the same individual through time - cohorts allow for controlling for individuality
    • The paired t-test is often used for assessing a repeated measures dataset - t.test(x1, x2, paired=TRUE) # x1 and x2 need to be the same length and each element needs to be the same individual
  • Repeated measures ANOVA is a conceptual extension of the paired t-test - are the means constant over time
    • anova(lmer(y ~ time + (1|individual)))
    • Can be used with glmer() also
    • Note that degrees of freedom is still an open question - different packages calculate this differently

Sleep study:

  • Applying LMER to the sleep study dataset - impact of drugs on sleep patterns for 10 patients followed over time
    • This is the classic “Student” dataset due to Guinness at the time not allowing its employees to publish
    • Ho will be that the amount of sleep does not vary with the treatments
    • Modeling will be done using a linear mixed model
  • Modeling approach - iteratively;
    • EDA
    • Simple regression
    • Model of interest
    • Extract information from model
    • Visualize final data

Hate in NY state?

  • Change in rate of hate crimes over time by county - available from data.gov for 2010-2016
  • Level of technical details in reporting should vary significantly by audience - blend data in to story for wider audiences, while being reporducible/technical for a scientifc audience

Wrap up:

  • Hiearchical data, mixed effects models, case studies
  • Start with the LME4 documentation for additional explorations and details

Example code includes:

y <- c(0.23, 2.735, -0.038, 6.327, -0.643, 1.69, -1.378, -1.228, -0.252, 2.014, -0.073, 6.101, 0.213, 3.127, -0.29, 8.395, -0.33, 2.735, 0.223, 1.301)
treat <- rep(c("before", "after"), times=10)
x <- rep(letters[1:10], each=2)

# Run a standard, non-paired t-test
t.test(y[treat == "before"], y[treat == "after"], paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  y[treat == "before"] and y[treat == "after"]
## t = -3.9043, df = 9.5409, p-value = 0.003215
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.594744 -1.512256
## sample estimates:
## mean of x mean of y 
##   -0.2338    3.3197
# Run a standard, paired t-test
t.test(y[treat == "before"], y[treat == "after"], paired = TRUE)
## 
##  Paired t-test
## 
## data:  y[treat == "before"] and y[treat == "after"]
## t = -4.2235, df = 9, p-value = 0.002228
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.456791 -1.650209
## sample estimates:
## mean of the differences 
##                 -3.5535
library(lmerTest)
## Loading required package: lme4
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step
library(lme4)

# Run the paired-test like before
t.test(y[treat == "before"], y[treat == "after"], paired = TRUE)
## 
##  Paired t-test
## 
## data:  y[treat == "before"] and y[treat == "after"]
## t = -4.2235, df = 9, p-value = 0.002228
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.456791 -1.650209
## sample estimates:
## mean of the differences 
##                 -3.5535
# Run a repeated-measures ANOVA
anova(lmer( y ~ treat + (1|x)))
## Type III Analysis of Variance Table with Satterthwaite's method
##       Sum Sq Mean Sq NumDF  DenDF F value   Pr(>F)   
## treat 63.137  63.137     1 8.9999  17.838 0.002228 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data(sleepstudy, package="lme4")
str(sleepstudy)
## 'data.frame':    180 obs. of  3 variables:
##  $ Reaction: num  250 259 251 321 357 ...
##  $ Days    : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...
# Plot the data
ggplot(data = sleepstudy) +
    geom_line(aes(x = Days, y = Reaction, group = Subject)) +
    stat_smooth(aes(x = Days, y = Reaction), method = 'lm', size = 3, se = FALSE)

# Build a lm 
lm( Reaction ~ Days, data = sleepstudy)
## 
## Call:
## lm(formula = Reaction ~ Days, data = sleepstudy)
## 
## Coefficients:
## (Intercept)         Days  
##      251.41        10.47
# Build a lmer
(lmerOut <- lmer( Reaction ~ Days + (1|Subject), data = sleepstudy))
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: Reaction ~ Days + (1 | Subject)
##    Data: sleepstudy
## REML criterion at convergence: 1786.465
## Random effects:
##  Groups   Name        Std.Dev.
##  Subject  (Intercept) 37.12   
##  Residual             30.99   
## Number of obs: 180, groups:  Subject, 18
## Fixed Effects:
## (Intercept)         Days  
##      251.41        10.47
# The lmer model you built during the previous exercise has been saved as lmerOut
# During this exercise, you will examine the effects of drug type using both an ANOVA framework and a regression framework

# Run an anova
anova(lmerOut)
## Type III Analysis of Variance Table with Satterthwaite's method
##      Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
## Days 162703  162703     1   161   169.4 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Look at the regression coefficients
summary(lmerOut)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Reaction ~ Days + (1 | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1786.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2257 -0.5529  0.0109  0.5188  4.2506 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 1378.2   37.12   
##  Residual              960.5   30.99   
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept) 251.4051     9.7467  22.8102   25.79   <2e-16 ***
## Days         10.4673     0.8042 161.0000   13.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.371
# Read in NY hate data
rawHate <- read.csv("./RInputFiles/hateNY.csv")
hate <- rawHate
str(hate)
## 'data.frame':    233 obs. of  4 variables:
##  $ Year          : int  2010 2011 2012 2013 2014 2015 2016 2013 2010 2011 ...
##  $ County        : Factor w/ 59 levels "Albany","Allegany",..: 1 1 1 1 1 1 1 2 3 3 ...
##  $ TotalIncidents: int  13 7 5 3 3 3 3 1 22 11 ...
##  $ Year2         : int  0 1 2 3 4 5 6 3 0 1 ...
ggplot( data = hate, aes(x = Year, y = TotalIncidents, group = County)) +
    geom_line() + 
    geom_smooth(method = 'lm', se = FALSE)

# During this exercise, you will build a glmer
# Because most of the incidents are small count values, use a Poisson (R function family poisson) error term
# First, build a model using the actually year (variable Year, such as 2006, 2007, etc) - this model will fail
# Second, build a model using the rescaled year (variable Year2, such as 0, 1, 2, etc)
# This demonstrates the importance of considering where the intercept is located when building regression models
# Recall that a variable x can be both a fixed and random effect in a lmer() or glmer(): for example lmer(y ~ x + (x| group) demonstrates this syntax

# glmer with raw Year
glmer(TotalIncidents ~ Year + (Year|County), data = hate, family = "poisson")
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.370207
## (tol = 0.001, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: TotalIncidents ~ Year + (Year | County)
##    Data: hate
##       AIC       BIC    logLik  deviance  df.resid 
## 1165.2746 1182.5298 -577.6373 1155.2746       228 
## Random effects:
##  Groups Name        Std.Dev. Corr 
##  County (Intercept) 217.8915      
##         Year          0.1084 -1.00
## Number of obs: 233, groups:  County, 59
## Fixed Effects:
## (Intercept)         Year  
##    295.4814      -0.1464  
## convergence code 0; 3 optimizer warnings; 0 lme4 warnings
# glmer with scaled Year, Year2
glmerOut <- glmer(TotalIncidents ~ Year2 + (Year2|County), data = hate, family = "poisson")
summary(glmerOut)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: TotalIncidents ~ Year2 + (Year2 | County)
##    Data: hate
## 
##      AIC      BIC   logLik deviance df.resid 
##   1165.3   1182.5   -577.6   1155.3      228 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.5434 -0.4864 -0.1562  0.3319  3.1939 
## 
## Random effects:
##  Groups Name        Variance Std.Dev. Corr
##  County (Intercept) 1.16291  1.0784       
##         Year2       0.01175  0.1084   0.02
## Number of obs: 233, groups:  County, 59
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.27952    0.16600   7.708 1.28e-14 ***
## Year2       -0.14622    0.03324  -4.398 1.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## Year2 -0.338
# Extract and manipulate data
countyTrend <- ranef(glmerOut)$County
countyTrend$county <- factor(row.names(countyTrend), levels =row.names(countyTrend)[order(countyTrend$Year2)])

# Plot results 
ggplot(data = countyTrend, aes(x = county, y = Year2)) + geom_point() +
    coord_flip() + 
    ylab("Change in hate crimes per year")  +
    xlab("County")


Forecasting Product Demand in R

Chapter 1 - Forecasting Demand with Time Series

Loading data in to an xts object:

  • The xts object will be the buidling block for the course - extensible time series (xts) is an extension of the zoo package - basically, a time index attached to the data matrix
  • Can create dates using dates=seq(as.Date(“MM-DD-YYYY”), length=, by=“weeks”) # to create weekly data
    • xts(myData, order.by=dates) # will create an XTS using dates as the index

ARIMA Time Series 101:

  • AR - AutoRegressive (lags help to determine today’s values - “long memory models”)
  • MA - Moving Average (shocks/errors help to determine today’s shocks/errors - “short memory models” due to dissipation)
  • I - Integrated (does the data have a dependency across time, and how long does it last) - make the time series stationary
    • Stationarity is the idea that effects disipate over time - today has more impact on tomorrow than on time periods in the future
    • Differencing (monthly, seasonal, etc.) the data can be a useful approach for data with stationarity
  • Begin by creating training dataset and valiadation training dataset
  • The auto.arima() function tries to estimate the best ARIMA for a given data series
    • ARIMA(p, d, q) is ARIMA(AR, Differencing, MA)

Forecasting and Evaluating:

  • Can use the ARIMA data to forecast the data forward - extrapolating the signal (forecasting) and estimating the amount of noise (error or CI)
  • The forecast() function in R simplifies the process - forecast(myModel, h=) which will forecast forward h time periods
  • Two common error measurements include MAE (mean average error) and MAPE (mean average percentage error)
    • MAPE is better at putting things on a common scale

Example code includes:

# Read in beverages data
rawBev <- read.csv("./RInputFiles/Bev.csv")
bev <- rawBev
str(bev)
## 'data.frame':    176 obs. of  14 variables:
##  $ M.hi.p  : num  59.2 56.3 56.3 49.3 61.3 ...
##  $ M.lo.p  : num  29.2 26.3 26.2 26.1 25.9 ...
##  $ MET.hi.p: num  63.7 60.3 60.8 55.1 65.1 ...
##  $ MET.lo.p: num  26 25.5 25.7 26.5 25.7 ...
##  $ MET.sp.p: num  50.1 48.8 48.6 47.7 50.8 ...
##  $ SEC.hi.p: num  58.6 54.6 57.9 49.7 63.7 ...
##  $ SEC.lo.p: num  29.2 26.3 26.2 26.1 25.9 ...
##  $ M.hi    : int  458 477 539 687 389 399 392 417 568 583 ...
##  $ M.lo    : int  1455 1756 2296 3240 2252 1901 1939 1999 1798 1558 ...
##  $ MET.hi  : int  2037 1700 1747 2371 1741 2072 2353 2909 3204 2395 ...
##  $ MET.lo  : int  3437 3436 3304 3864 3406 3418 3553 3376 3233 3262 ...
##  $ MET.sp  : int  468 464 490 657 439 453 423 408 501 481 ...
##  $ SEC.hi  : int  156 151 178 217 141 149 134 148 195 170 ...
##  $ SEC.lo  : int  544 624 611 646 624 610 623 599 551 539 ...
# Load xts package 
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(forecast)


# Create the dates object as an index for your xts object
dates <- seq(as.Date("2014-01-19"), length = 176, by = "weeks")

# Create an xts object called bev_xts
bev_xts <- xts(bev, order.by = dates)


# Create the individual region sales as their own objects
MET_hi <- bev_xts[,"MET.hi"]
MET_lo <- bev_xts[,"MET.lo"]
MET_sp <- bev_xts[,"MET.sp"]

# Sum the region sales together
MET_t <- MET_hi + MET_lo + MET_sp

# Plot the metropolitan region total sales
plot(MET_t)

# Split the data into training and validation
MET_t_train <- MET_t[index(MET_t) < "2017-01-01"]
MET_t_valid <- MET_t[index(MET_t) >= "2017-01-01"]

# Use auto.arima() function for metropolitan sales
MET_t_model <- auto.arima(MET_t_train)


# Forecast the first 22 weeks of 2017
forecast_MET_t <- forecast(MET_t_model, h = 22)

# Plot this forecast #
plot(forecast_MET_t)

# Convert to numeric for ease
for_MET_t <- as.numeric(forecast_MET_t$mean)
v_MET_t <- as.numeric(MET_t_valid)

# Calculate the MAE
MAE <- mean(abs(for_MET_t - v_MET_t))

# Calculate the MAPE
MAPE <- 100*mean(abs(for_MET_t - v_MET_t)/v_MET_t)

# Print to see how good your forecast is!
print(MAE)
## [1] 898.8403
print(MAPE)
## [1] 17.10332
# Convert your forecast to an xts object
for_dates <- seq(as.Date("2017-01-01"), length = 22, by = "weeks")
for_MET_t_xts <- xts(forecast_MET_t$mean, order.by = for_dates)

# Plot the validation data set
plot(for_MET_t_xts, main = 'Forecast Comparison', ylim = c(4000, 8500))

# Overlay the forecast of 2017
lines(MET_t_valid, col = "blue")

# Plot the validation data set
plot(MET_t_valid, main = 'Forecast Comparison', ylim = c(4000, 8500))

# Overlay the forecast of 2017
lines(for_MET_t_xts, col = "blue")

# Convert the limits to xts objects
lower <- xts(forecast_MET_t$lower[, 2], order.by = for_dates)
upper <- xts(forecast_MET_t$upper[, 2], order.by = for_dates)

# Adding confidence intervals of forecast to plot
lines(lower, col = "blue", lty = "dashed")

lines(upper, col = "blue", lty = "dashed")


Chapter 2 - Components of Demand

Price elasticity:

  • Price is one of the obvious factors that impacts demand, with the relationship called price elasticity (% dDemand / % dPrice)
    • Elastic goods have elasticity > 1, meaning demand changes more quickly (percentage wise) than price
    • Inelastic goods have elasticity < 1, for example gasoline
    • Unit elastic goods have elasticity = 1, meaning that X% increase in price drives X% decrease in demand
    • Linear regression can be employed to estimate the elasticity for a given product - the log-log transform helps get the % vs % coefficients

Seasonal/holiday/promotional effects:

  • Seasonal products are common - can be bought any time of the year, though certain seasons have higher demand (holidays are a common example)
  • Promotions are attempts by companies to influence demand
  • Linear regression can help determine relationships between demand and many other factors
    • If an xts vector has been created for key dates, can merge(train, holiday, fill=0) and the holiday column will be 0 wherever there is no match to holiday

Forecasting with regression:

  • Forecasting with time series is straightforward due to the lag nature of the models - tomorrow forecasts today and today forecasts tomorrow and etc.
  • Forecasting with regression can be more tricky, particularly since we need the future inputs (such as price) in order to predict the future demand
    • Even when there are contractually fixed prices, promotions can effectively create a de facto price change anyways
  • Need to have the same column names in the test/validation dataset as were used in the modeling
    • Then, can use predict(myModel, myData)
    • May need to exponentiate in case the data are currently on the log scale rather than the absolute scale

Example code includes:

bev_xts_train <- bev_xts[index(bev_xts) < "2017-01-01"]
bev_xts_valid <- bev_xts[index(bev_xts) >= "2017-01-01"]

# Save the prices of each product
l_MET_hi_p <- log(as.vector(bev_xts_train[, "MET.hi.p"]))

# Save as a data frame
MET_hi_train <- data.frame(as.vector(log(MET_hi[index(MET_hi) < "2017-01-01"])), l_MET_hi_p)
colnames(MET_hi_train) <- c("log_sales", "log_price")

# Calculate the regression
model_MET_hi <- lm(log_sales ~ log_price, data = MET_hi_train)


# Plot the product's sales
plot(MET_hi)

# Plot the product's price
MET_hi_p <- bev_xts_train[, "MET.hi.p"]
plot(MET_hi_p)

# Create date indices for New Year's week
n.dates <- as.Date(c("2014-12-28", "2015-12-27", "2016-12-25"))

# Create xts objects for New Year's
newyear <- as.xts(rep(1, 3), order.by = n.dates)

# Create sequence of dates for merging
dates_train <- seq(as.Date("2014-01-19"), length = 154, by = "weeks")

# Merge training dates into New Year's object
newyear <- merge(newyear, dates_train, fill = 0)


# Add newyear variable to your data frame
MET_hi_train <- data.frame(MET_hi_train, newyear=as.vector(newyear))

# Build regressions for the product
model_MET_hi_full <- lm(log_sales ~ log_price + newyear, data = MET_hi_train)


# Subset the validation prices #
l_MET_hi_p_valid <- log(as.vector(bev_xts_valid[, "MET.hi.p"]))

# Create a validation data frame #
MET_hi_valid <- data.frame(l_MET_hi_p_valid)
colnames(MET_hi_valid) <- "log_price"


# Predict the log of sales for your high end product
pred_MET_hi <- predict(model_MET_hi, MET_hi_valid)

# Convert predictions out of log scale
pred_MET_hi <- exp(pred_MET_hi)


# Convert to an xts object
dates_valid <- seq(as.Date("2017-01-01"), length = 22, by = "weeks")
pred_MET_hi_xts <- xts(pred_MET_hi, order.by = dates_valid)

# Plot the forecast
plot(pred_MET_hi_xts)

# Calculate and print the MAPE
MET_hi_v <- bev_xts_valid[,"MET.hi"]

MAPE <- 100*mean(abs((pred_MET_hi_xts - MET_hi_v)/MET_hi_v))
print(MAPE)
## [1] 29.57455

Chapter 3 - Blending Regression with Time Series

Residuals from regression model:

  • The residuals from the regression models can be used for further modeling - see if the residuals are related over time, and model them with time series if so
  • Need to start by gathering the residuals and then converting them to an XTS object - explore for patterns in this XTS object

Forecasting residuals:

  • When the residuals are related across time, we can use time series to model the residuals - basically, patterns to the errors provide an opportunity for further modeling
  • Can use auto.arima() on the residuals data, to see what the best ARIMA model for the residuals is
    • Can then forecast the residuals in to the future using forecast(myModel, h=) # h being the time periods to predict forward

Transfer functions and ensembling:

  • Techniques for combining forecasts - single model (transfer function) or averaging of models (ensembling)
  • Demand can be based on both regression (modeling external factors) and time series (residuals)
  • Ensembling is a combination (blend) of the forecasts, with simple averaging being the simplest approach
    • Basically, build a stand-alone time series model and a stand-alone regression model
    • The ensemble forecast can be better or worse than any of the stand-alone models

Example code includes:

# Calculate the residuals from the model
MET_hi_full_res <- resid(model_MET_hi_full)

# Convert the residuals to an xts object
MET_hi_full_res <- xts(MET_hi_full_res, order.by = dates_train)


# Plot the histogram of the residuals
hist(MET_hi_full_res)

# Plot the residuals over time
plot(MET_hi_full_res)

# Build an ARIMA model on the residuals: MET_hi_arima
MET_hi_arima <- auto.arima(MET_hi_full_res)

# Look at a summary of the model
summary(MET_hi_arima)
## Series: MET_hi_full_res 
## ARIMA(3,0,1) with zero mean 
## 
## Coefficients:
##          ar1      ar2      ar3      ma1
##       1.2150  -0.1758  -0.2945  -0.4675
## s.e.  0.1578   0.1885   0.0901   0.1574
## 
## sigma^2 estimated as 0.03905:  log likelihood=32.51
## AIC=-55.02   AICc=-54.61   BIC=-39.83
## 
## Training set error measures:
##                         ME      RMSE       MAE      MPE     MAPE      MASE
## Training set -0.0004563225 0.1950202 0.1456822 -46.5269 224.0129 0.5251787
##                     ACF1
## Training set 0.001241359
# Forecast 22 weeks with your model: for_MET_hi_arima
for_MET_hi_arima <- forecast(MET_hi_arima, h=22)

# Print first 10 observations
head(for_MET_hi_arima$mean, n = 10)
## Time Series:
## Start = 1079 
## End = 1142 
## Frequency = 0.142857142857143 
##  [1] -0.05952097 -0.12319548 -0.10839523 -0.09251312 -0.05706575
##  [6] -0.02114739  0.01158452  0.03459993  0.04623106  0.04667681
# Convert your forecasts into an xts object
dates_valid <- seq(as.Date("2017-01-01"), length = 22, by = "weeks")
for_MET_hi_arima <- xts(for_MET_hi_arima$mean, order.by = dates_valid)

# Plot the forecast
plot(for_MET_hi_arima)

# Convert your residual forecast to the exponential version
for_MET_hi_arima <- exp(for_MET_hi_arima)

# Multiply your forecasts together!
for_MET_hi_final <- for_MET_hi_arima * pred_MET_hi_xts


# Plot the final forecast - don't touch the options!
plot(for_MET_hi_final, ylim = c(1000, 4300))

# Overlay the validation data set
lines(MET_hi_v, col = "blue")

# Calculate the MAE
MAE <- mean(abs(for_MET_hi_final - MET_hi_v))
print(MAE)
## [1] 474.7013
# Calculate the MAPE
MAPE <- 100 * mean(abs(for_MET_hi_final - MET_hi_v)/MET_hi_v)
print(MAPE)
## [1] 28.44671
# Build an ARIMA model using the auto.arima function
MET_hi_model_arima <- auto.arima(MET_hi)

# Forecast the ARIMA model
for_MET_hi <- forecast(MET_hi_model_arima, h = length(MET_hi_v))

# Save the forecast as an xts object
dates_valid <- seq(as.Date("2017-01-01"), length = 22, by = "weeks")
for_MET_hi_xts <- xts(for_MET_hi$mean, order.by = dates_valid)

# Calculate the MAPE of the forecast
MAPE <- 100 * mean(abs(for_MET_hi_xts - MET_hi_v) / MET_hi_v)
print(MAPE)
## [1] 36.95411
# Ensemble the two forecasts together
for_MET_hi_en <- 0.5 * (for_MET_hi_xts + pred_MET_hi_xts)

# Calculate the MAE and MAPE
MAE <- mean(abs(for_MET_hi_en - MET_hi_v))
print(MAE)
## [1] 533.8911
MAPE <- 100 * mean(abs(for_MET_hi_en - MET_hi_v) / MET_hi_v)
print(MAPE)
## [1] 32.28549

Chapter 4 - Hierarchical Forecasting

Bottom-Up Hierarchical Forecasting:

  • The hierarchical data structuring can be an advantage in forecasting, provided that the data has a natural hierarchy
  • The sum of all the lower-level forecasts should equal the higher-level forecasts
    • Bottom-up: Forecast at the lowest level and aggregate (easiest but requires the most number of forecasts)
    • Top-down: Forecast at the top level and the apply downwards
    • Middle-out: Forecast at the middle levels and then apply both upwards and downwards

Top-Down Hierarchical Forecasting:

  • The top-down forecasting process is typically quicker but less accurate than the bottom-up forecasting process
  • Two techniques available for top-down reconciliation
    • Average of historical proportions - mean percentage that each component contributes to the total (calculated by sub-component such as week)
    • Proportion of historical averages - mean percentage that each component contributes to the total (calculated by aggregate)
  • Reconciled forecasts at lower levels are typically less accurate than the direct forecast of the lower levels

Middle-Out Hierarchical Forecasting:

  • Bottom-up forecasting is higher quality but more time-consuming than top-down forecasting
  • The middle-out forecasting method is a sometimes successful blend of the methods, getting decent accuracy at a lesser time commitment

Wrap up:

  • Using time series to forecast demand forward
  • Incorporating external factors using linear regression
  • Blending time series and regression approaches
  • Top-down, bottom-up, middle-out approaches to aggregation and forecasting at various levels (hierarchical)
  • Can extend by looking at cross-elasticities (impact of competitor pricing)
  • Can better forecast proportions using time series analysis
  • Additional demand forecasting models include neural networks, exponential smoothing, etc.

Example code includes:

# Build a time series model 
MET_sp_model_arima <- auto.arima(MET_sp)

# Forecast the time series model for 22 periods
for_MET_sp <- forecast(MET_sp_model_arima, h=22)

# Create an xts object
for_MET_sp_xts <- xts(for_MET_sp$mean, order.by=dates_valid)

MET_sp_v <- MET_sp["2017"]

# Calculate the MAPE
MAPE <- 100 * mean(abs(for_MET_sp_xts - MET_sp_v) / MET_sp_v)
print(MAPE)
## [1] 6.772393
MET_sp_train <- bev_xts_train %>%
    transform(log_sales = log(MET.sp), log_price=log(MET.sp.p))
MET_sp_train <- MET_sp_train[, c("log_sales", "log_price")]
MET_sp_train$newyear <- 0
MET_sp_train$valentine <- 0
MET_sp_train$christmas <- 0
MET_sp_train$mother <- 0

MET_sp_train[index(MET_sp_train) %in% as.Date(c("2014-12-28", "2015-12-27", "2016-12-25")), "newyear"] <- 1
MET_sp_train[index(MET_sp_train) %in% as.Date(c("2014-02-09", "2015-02-08", "2016-02-07")), "valentine"] <- 1
MET_sp_train[index(MET_sp_train) %in% as.Date(c("2014-12-21", "2015-12-20", "2016-12-18")), "christmas"] <- 1
MET_sp_train[index(MET_sp_train) %in% as.Date(c("2014-05-04", "2015-05-03", "2016-05-01")), "mother"] <- 1


# THE BELOW IS TOTAL NONSENSE
# Build a regression model
model_MET_sp <- lm(log_sales ~ log_price + newyear + valentine + christmas + mother, data = MET_sp_train)


MET_sp_valid <- as.data.frame(bev_xts_valid) %>%
    mutate(log_sales = log(MET.sp), log_price=log(MET.sp.p)) %>%
    select("log_sales", "log_price")
MET_sp_valid$newyear <- 0
MET_sp_valid$valentine <- 0
MET_sp_valid$christmas <- 0
MET_sp_valid$mother <- 0  

MET_sp_valid[7, "valentine"] <- 1
MET_sp_valid[19, "mother"] <- 1
MET_sp_valid$log_sales <- NULL


# Forecast the regression model using the predict function 
pred_MET_sp <- predict(model_MET_sp, MET_sp_valid)

# Exponentiate your predictions and create an xts object
pred_MET_sp <- exp(pred_MET_sp)
pred_MET_sp_xts <- xts(pred_MET_sp, order.by = dates_valid)

# Calculate MAPE
MAPE <- 100*mean(abs((pred_MET_sp_xts - MET_sp_v)/MET_sp_v))
print(MAPE)
## [1] 6.55473
# Ensemble the two forecasts
for_MET_sp_en <- 0.5 * (for_MET_sp_xts + pred_MET_sp_xts)

# Calculate the MAPE
MAPE <- 100 * mean(abs(for_MET_sp_en - MET_sp_v) / MET_sp_v)
print(MAPE)
## [1] 6.08091
# Copy over pred_MET_lo_xts
pred_MET_lo_xts <- xts(c(2960.6, 2974.1, 2943.2, 2948.6, 2915.6, 2736.4, 2953.9, 3199.4, 2934, 2898.7, 3027.7, 3165.9, 3073.1, 2842.7, 2928.7, 3070.2, 2982.2, 3018, 3031.9, 2879.4, 2993.2, 2974.1), order.by=dates_valid)


# Calculate the metropolitan regional sales forecast
for_MET_total <- pred_MET_hi_xts + for_MET_sp_en + pred_MET_lo_xts

# Calculate a validation data set 
MET_t_v <- bev_xts_valid[,"MET.hi"] + bev_xts_valid[,"MET.lo"] + bev_xts_valid[,"MET.sp"]

# Calculate the MAPE
MAPE <- 100 * mean(abs(for_MET_total - MET_t_v) / MET_t_v)
print(MAPE)
## [1] 10.61441
# Create the MET_total data
MET_total <- xts(data.frame(MET.hi=c(5942, 5600, 5541, 6892, 5586, 5943, 6329, 6693, 6938, 6138, 6361, 6378, 5423, 5097, 4937, 5496, 6870, 6626, 6356, 5657, 6577, 7202, 7381, 7404, 7204, 6667, 6153, 6035, 5633, 5283, 5178, 4758, 5058, 5254, 5954, 6166, 6247, 6304, 7202, 6662, 6814, 6174, 5412, 5380, 5674, 6472, 6912, 7404, 8614, 8849, 7174, 6489, 7174, 6555, 6402, 7671, 5012, 4790, 5075, 5238, 5615, 6113, 7706, 7811, 7898, 7232, 6585, 5870, 7084, 5125, 5330, 5553, 6349, 6195, 6271, 5851, 5333, 5854, 5609, 5649, 6051, 6409, 5786, 5190, 5085, 4949, 5151, 5147, 5426, 5509, 6956, 7870, 8224, 6685, 6153, 5802, 5244, 5162, 5036, 5025, 8378, 8944, 7109, 7605, 7846, 7598, 8012, 9551, 6102, 5366, 4932, 4962, 5392, 6194, 7239, 7621, 7460, 7097, 6596, 5848, 8306, 5344, 5848, 6341, 7364, 7269, 7053, 6682, 6971, 7521, 7063, 6298, 6003, 5227, 5047, 4877, 4851, 4628, 4516, 4442, 4935, 5181, 5431, 5866, 5919, 5704, 5957, 6019, 5962, 6021, 5880, 5674, 7439, 7415)),
                 order.by=dates_train
                 )

# Build a regional time series model
MET_t_model_arima <- auto.arima(MET_total)

# Calculate a 2017 forecast for 22 periods
for_MET_t <- forecast(MET_t_model_arima, h=22)

# Make an xts object from your forecast
for_MET_t_xts <- xts(for_MET_t$mean, order.by=dates_valid)

# Calculate the MAPE
MAPE <- 100 * mean(abs(for_MET_t_xts - MET_t_v) / MET_t_v)
print(MAPE)
## [1] 17.10332
# Calculate the average historical proportions
prop_hi <- mean(MET_hi/MET_total)
prop_lo <- mean(MET_lo/MET_total)
prop_sp <- mean(MET_sp/MET_total)

# Distribute out your forecast to each product
for_prop_hi <- prop_hi*for_MET_t_xts
for_prop_lo <- prop_lo*for_MET_t_xts
for_prop_sp <- prop_sp*for_MET_t_xts

# Calculate the MAPE's for each product
MAPE_hi <- 100 * mean(abs(for_prop_hi - MET_hi_v) / MET_hi_v)
print(MAPE_hi)
## [1] 38.7318
MET_lo_v <- bev_xts_valid[,"MET.lo"]
MAPE_lo <- 100 * mean(abs(for_prop_lo - MET_lo_v) / MET_lo_v)
print(MAPE_lo)
## [1] 10.70649
MAPE_sp <- 100 * mean(abs(for_prop_sp - MET_sp_v) / MET_sp_v)
print(MAPE_sp)
## [1] 6.232888
# Calculate the average historical proportions
prop_hi_2 <- mean(MET_hi) / mean(MET_total)
prop_lo_2 <- mean(MET_lo) / mean(MET_total)
prop_sp_2 <- mean(MET_sp) / mean(MET_total)

# Distribute out your forecast to each product
for_prop_hi_2 <- prop_hi_2 * for_MET_t_xts
for_prop_lo_2 <- prop_lo_2 * for_MET_t_xts
for_prop_sp_2 <- prop_sp_2 * for_MET_t_xts

# Calculate the MAPE's for each product
MAPE_hi <- 100 * mean(abs(for_prop_hi_2 - MET_hi_v) / MET_hi_v)
print(MAPE_hi)
## [1] 38.33559
MAPE_lo <- 100 * mean(abs(for_prop_lo_2 - MET_lo_v) / MET_lo_v)
print(MAPE_lo)
## [1] 8.450784
MAPE_sp <- 100 * mean(abs(for_prop_sp_2 - MET_sp_v) / MET_sp_v)
print(MAPE_sp)
## [1] 6.517045
SEC_total <- xts(data.frame(SEC.hi=c(700, 775, 789, 863, 765, 759, 757, 747, 746, 709, 749, 786, 796, 726, 727, 723, 778, 755, 739, 740, 723, 695, 727, 707, 725, 684, 667, 698, 727, 722, 748, 695, 742, 739, 715, 724, 686, 671, 688, 682, 710, 700, 672, 680, 695, 780, 751, 693, 809, 881, 703, 712, 768, 796, 808, 904, 641, 662, 693, 725, 719, 736, 715, 722, 732, 745, 689, 705, 811, 739, 744, 700, 745, 735, 732, 722, 721, 732, 750, 714, 752, 677, 731, 674, 720, 675, 741, 722, 715, 719, 649, 697, 743, 733, 772, 698, 690, 734, 713, 644, 788, 833, 749, 731, 670, 675, 675, 993, 773, 751, 697, 677, 750, 723, 780, 763, 721, 701, 704, 684, 985, 791, 731, 714, 704, 694, 685, 652, 708, 754, 747, 705, 711, 699, 712, 745, 706, 665, 666, 692, 676, 696, 689, 697, 689, 717, 697, 708, 660, 707, 715, 680, 922, 888)), order.by=dates_train
                 )

# Build a time series model for the region
SEC_t_model_arima <- auto.arima(SEC_total)

# Forecast the time series model
for_SEC_t <- forecast(SEC_t_model_arima, h=22)

# Make into an xts object
for_SEC_t_xts <- xts(for_SEC_t$mean, order.by=dates_valid)

SEC_t_v <- bev_xts_valid$SEC.hi + bev_xts_valid$SEC.lo
# Calculate the MAPE
MAPE <- 100 * mean(abs(for_SEC_t_xts - SEC_t_v) / SEC_t_v)
print(MAPE)
## [1] 4.742324
SEC_hi <- bev_xts_train[, "SEC.hi"]
SEC_lo <- bev_xts_train[, "SEC.lo"]
SEC_hi_v <- bev_xts_valid[, "SEC.hi"]
SEC_lo_v <- bev_xts_valid[, "SEC.lo"]

# Calculate the average of historical proportions
prop_hi <- mean(SEC_hi / SEC_total)
prop_lo <- mean(SEC_lo / SEC_total)

# Distribute the forecast
for_prop_hi <- prop_hi * for_SEC_t_xts
for_prop_lo <- prop_lo * for_SEC_t_xts

# Calculate a MAPE for each product
MAPE_hi <- 100 * mean(abs(for_prop_hi - SEC_hi_v) / SEC_hi_v)
print(MAPE_hi)
## [1] 7.988508
MAPE_lo <- 100 * mean(abs(for_prop_lo - SEC_lo_v) / SEC_lo_v)
print(MAPE_lo)
## [1] 5.202529
# Copy over for_M_t_xts data
for_M_t_xts <- xts(c(2207, 2021, 2010, 2052, 2075, 2074, 2065, 2058, 2056, 2055, 2053, 2052, 2050, 2049, 2048, 2047, 2046, 2045, 2044, 2043, 2043, 2042), order.by=dates_valid)

# Calculate the state sales forecast: for_state
for_state = for_SEC_t_xts + for_MET_t_xts + for_M_t_xts

# See the forecasts
for_state
##                [,1]
## 2017-01-01 9996.689
## 2017-01-08 9525.915
## 2017-01-15 9342.760
## 2017-01-22 9269.321
## 2017-01-29 9214.912
## 2017-02-05 9162.005
## 2017-02-12 9118.199
## 2017-02-19 9087.859
## 2017-02-26 9070.209
## 2017-03-05 9058.715
## 2017-03-12 9049.677
## 2017-03-19 9043.959
## 2017-03-26 9038.794
## 2017-04-02 9035.673
## 2017-04-09 9033.250
## 2017-04-16 9031.296
## 2017-04-23 9029.656
## 2017-04-30 9028.227
## 2017-05-07 9026.939
## 2017-05-14 9025.746
## 2017-05-21 9025.617
## 2017-05-28 9024.530

HR Analytics in R: Exploring Employee Data

Chapter 1 - Identifying the Best Recruiting Source

Introduction - Ben Teusch, HR Analytics Consultant:

  • HR analytics has many other names - people analytics, workforce analytics, etc.
  • Identify groups for comparison - high vs. low performers, groups with high vs. low turnover, etc.
    • Exploratory analysis and statistics for each group, including plots of key differences
  • Course is outlines as a series of case studies, with one case per chapter

Recruiting and quality of hire:

  • Where are the best hires coming from, and how can you get more of them
    • Defining quality of hire is challenging - some mix of productivity, satisfaction, retention, performance reviews, etc.
    • Attrition can be defined as the mean of a 1, 0 vector of “did the person leave in the time period T”

Visualizing recruiting data:

  • Helpful for communicating findings to decision makers
  • The geom_col() in ggplot will make a bar chart, with the y aestehtic being the bar height

Example code includes:

# Import the recruitment data
recruitment <- readr::read_csv("./RInputFiles/recruitment_data.csv")
## Parsed with column specification:
## cols(
##   attrition = col_double(),
##   performance_rating = col_double(),
##   sales_quota_pct = col_double(),
##   recruiting_source = col_character()
## )
# Look at the first few rows of the dataset
head(recruitment)
## # A tibble: 6 x 4
##   attrition performance_rating sales_quota_pct recruiting_source
##       <dbl>              <dbl>           <dbl> <chr>            
## 1         1                  3           1.09  Applied Online   
## 2         0                  3           2.39  <NA>             
## 3         1                  2           0.498 Campus           
## 4         0                  2           2.51  <NA>             
## 5         0                  3           1.42  Applied Online   
## 6         1                  3           0.548 Referral
# Get an overview of the recruitment data
summary(recruitment)
##    attrition     performance_rating sales_quota_pct   recruiting_source 
##  Min.   :0.000   Min.   :1.000      Min.   :-0.7108   Length:446        
##  1st Qu.:0.000   1st Qu.:2.000      1st Qu.: 0.5844   Class :character  
##  Median :0.000   Median :3.000      Median : 1.0701   Mode  :character  
##  Mean   :0.213   Mean   :2.895      Mean   : 1.0826                     
##  3rd Qu.:0.000   3rd Qu.:3.000      3rd Qu.: 1.5325                     
##  Max.   :1.000   Max.   :5.000      Max.   : 3.6667
# See which recruiting sources the company has been using
recruitment %>% 
  count(recruiting_source)
## # A tibble: 5 x 2
##   recruiting_source     n
##   <chr>             <int>
## 1 <NA>                205
## 2 Applied Online      130
## 3 Campus               56
## 4 Referral             45
## 5 Search Firm          10
# Find the average sales quota attainment for each recruiting source
avg_sales <- recruitment %>% 
  group_by(recruiting_source) %>% 
  summarize(avg_sales_quota_pct=mean(sales_quota_pct))

# Display the result
avg_sales
## # A tibble: 5 x 2
##   recruiting_source avg_sales_quota_pct
##   <chr>                           <dbl>
## 1 <NA>                            1.17 
## 2 Applied Online                  1.06 
## 3 Campus                          0.908
## 4 Referral                        1.02 
## 5 Search Firm                     0.887
# Find the average attrition for the sales team, by recruiting source, sorted from lowest attrition rate to highest
avg_attrition <- recruitment %>%
  group_by(recruiting_source) %>% 
  summarize(attrition_rate=mean(attrition)) %>%
  arrange(attrition_rate)

# Display the result
avg_attrition
## # A tibble: 5 x 2
##   recruiting_source attrition_rate
##   <chr>                      <dbl>
## 1 <NA>                       0.132
## 2 Applied Online             0.246
## 3 Campus                     0.286
## 4 Referral                   0.333
## 5 Search Firm                0.5
# Plot the bar chart
avg_sales %>% ggplot(aes(x=recruiting_source, y=avg_sales_quota_pct)) + geom_col()

# Plot the bar chart
avg_attrition %>% ggplot(aes(x=recruiting_source, y=attrition_rate)) + geom_col()


Chapter 2 - What is driving low employee engagement

Analyzing employee engagement:

  • Gallup defines engaged employees as those who are involved in, enthusiastic about, and committed to their workplace
  • Survey data are available in the example case study
    • Will use both mutate() and ifelse()
    • The ifelse() is needed for vectors of length > 1 since it can work in a vectorized manner (and is thus OK inside the mutate call)

Visualizing the engagement data:

  • Multiple attributes in a single place can make for a more compelling report
  • The tidyr package is part of the tidyverse, and hslps arrange the data properly for plotting
    • tidyr::gather(columns, key=“key”, value=“value”) will be the package used in this example - pull the data from the columns down to the rows
    • ggplot(survey_gathered, aes(x = key, y = value, fill = department)) + geom_col(position = “dodge”)
    • ggplot(survey_gathered, aes(x = key, y = value, fill = department)) + geom_col(position = “dodge”) + facet_wrap(~ key, scales = “free”)

Are differences meaningful?

  • Can use significance testing to assess likelhood (p-value) that the second sample could have come from the same population as the first sample
    • This course will use t-test (continuous variables) and chi-squared test (categorical variables)
    • t.test(tenure ~ is_manager, data = survey)
    • chisq.test(survey\(left_company, survey\)is_manager) # no data= argument is available in the function

Example code includes:

# Import the data
survey <- readr::read_csv("./RInputFiles/survey_data.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_double(),
##   department = col_character(),
##   engagement = col_double(),
##   salary = col_double(),
##   vacation_days_taken = col_double()
## )
# Get an overview of the data
summary(survey)
##   employee_id      department          engagement       salary      
##  Min.   :   1.0   Length:1470        Min.   :1.00   Min.   : 45530  
##  1st Qu.: 491.2   Class :character   1st Qu.:3.00   1st Qu.: 59407  
##  Median :1020.5   Mode  :character   Median :3.00   Median : 70481  
##  Mean   :1024.9                      Mean   :3.05   Mean   : 74162  
##  3rd Qu.:1555.8                      3rd Qu.:4.00   3rd Qu.: 84763  
##  Max.   :2068.0                      Max.   :5.00   Max.   :164073  
##  vacation_days_taken
##  Min.   : 0.00      
##  1st Qu.: 6.00      
##  Median :10.00      
##  Mean   :11.27      
##  3rd Qu.:16.00      
##  Max.   :38.00
# Examine the counts of the department variable
survey %>% count(department)
## # A tibble: 3 x 2
##   department      n
##   <chr>       <int>
## 1 Engineering   961
## 2 Finance        63
## 3 Sales         446
# Output the average engagement score for each department, sorted
survey %>%
  group_by(department) %>%
  summarize(avg_engagement=mean(engagement)) %>%
  arrange(avg_engagement)
## # A tibble: 3 x 2
##   department  avg_engagement
##   <chr>                <dbl>
## 1 Sales                 2.81
## 2 Engineering           3.15
## 3 Finance               3.24
# Create the disengaged variable and assign the result to survey
survey_disengaged <- survey %>% 
  mutate(disengaged = ifelse(engagement <= 2, 1, 0)) 

survey_disengaged
## # A tibble: 1,470 x 6
##    employee_id department  engagement  salary vacation_days_tak~ disengaged
##          <dbl> <chr>            <dbl>   <dbl>              <dbl>      <dbl>
##  1           1 Sales                3 103264.                  7          0
##  2           2 Engineering          3  80709.                 12          0
##  3           4 Engineering          3  60737.                 12          0
##  4           5 Engineering          3  99116.                  7          0
##  5           7 Engineering          3  51022.                 18          0
##  6           8 Engineering          3  98400.                  9          0
##  7          10 Engineering          3  57106.                 18          0
##  8          11 Engineering          1  55065.                  4          1
##  9          12 Engineering          4  77158.                 12          0
## 10          13 Engineering          2  48365.                 14          1
## # ... with 1,460 more rows
# Summarize the three variables by department
survey_summary <- survey_disengaged %>%
  group_by(department) %>%
  summarize(pct_disengaged=mean(disengaged), 
            avg_salary=mean(salary), 
            avg_vacation_taken=mean(vacation_days_taken)
            )

survey_summary
## # A tibble: 3 x 4
##   department  pct_disengaged avg_salary avg_vacation_taken
##   <chr>                <dbl>      <dbl>              <dbl>
## 1 Engineering          0.206     73576.              12.2 
## 2 Finance              0.190     76652.              11.5 
## 3 Sales                0.330     75074.               9.22
# Gather data for plotting
survey_gathered <- survey_summary %>% 
  gather(key = "measure", value = "value",
         pct_disengaged, avg_salary, avg_vacation_taken)

# Create three bar charts
ggplot(survey_gathered, aes(x=measure, y=value, fill=department)) +
  geom_col(position="dodge") + 
  facet_wrap(~ measure, scales="free")

# Add the in_sales variable
survey_sales <- survey %>%
  mutate(in_sales = ifelse(department == "Sales", "Sales", "Other"), 
         disengaged = ifelse(engagement < 3, 1L, 0L)
         )

# Test the hypothesis using survey_sales
chisq.test(survey_sales$disengaged, survey_sales$in_sales)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  survey_sales$disengaged and survey_sales$in_sales
## X-squared = 25.524, df = 1, p-value = 4.368e-07
t.test(disengaged ~ in_sales, data=survey_sales)
## 
##  Welch Two Sample t-test
## 
## data:  disengaged by in_sales
## t = -4.862, df = 743.16, p-value = 1.419e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.17479596 -0.07424062
## sample estimates:
## mean in group Other mean in group Sales 
##           0.2050781           0.3295964
# Test the hypothesis using the survey_sales data
t.test(vacation_days_taken ~ in_sales, data = survey_sales)
## 
##  Welch Two Sample t-test
## 
## data:  vacation_days_taken by in_sales
## t = 8.1549, df = 1022.9, p-value = 1.016e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.229473 3.642409
## sample estimates:
## mean in group Other mean in group Sales 
##           12.160156            9.224215

Chapter 3 - Are new hires getting paid too much?

Paying new hires fairly:

  • Sometimes, current employees get paid less than new employees, which can drive low engagement and turnover
  • Case study will have a simulated pay dataset available for analysis
  • Can use broom::tidy() to return the outputs in a nicely formatted data frame
    • chisq.test(survey\(in_sales, survey\)disengaged) %>% tidy()

Omitted variable bias:

  • Key assumption of the tests is that the groups are the same, with the exception of the variables being tested
  • Omitted variable bias occurs when both 1) the omitted variable is correlated with the dependent variable, and 2) the omitted variable is correlated with an explanatory variable
    • Omitted variables are often known as confounders
    • Plotting can help to identify the issue, particularly with a stacked (to 100%) bar chart
    • pay %>% ggplot(aes(x = new_hire, fill = department)) + geom_bar(position = “fill”)
    • The geom_bar() object has height that is fully dependent on x, in contrast to geom_col() which has a y-aestehtic

Linear regression helps to test the multivariate impacts of variables:

  • lm(salary ~ new_hire, data = pay) %>% tidy() # single dependent variable
  • lm(salary ~ new_hire + department, data = pay) %>% tidy() # multiple dependent variables
  • lm(salary ~ new_hire + department, data = pay) %>% summary() # more detailed summary of the linear regression

Example code includes:

# Import the data
pay <- readr::read_csv("./RInputFiles/fair_pay_data.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_double(),
##   department = col_character(),
##   salary = col_double(),
##   new_hire = col_character(),
##   job_level = col_character()
## )
# Get an overview of the data
summary(pay)
##   employee_id      department            salary         new_hire        
##  Min.   :   1.0   Length:1470        Min.   : 43820   Length:1470       
##  1st Qu.: 491.2   Class :character   1st Qu.: 59378   Class :character  
##  Median :1020.5   Mode  :character   Median : 70425   Mode  :character  
##  Mean   :1024.9                      Mean   : 74142                     
##  3rd Qu.:1555.8                      3rd Qu.: 84809                     
##  Max.   :2068.0                      Max.   :164073                     
##   job_level        
##  Length:1470       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Check average salary of new hires and non-new hires
pay %>% 
  group_by(new_hire) %>%
  summarize(avg_salary=mean(salary))
## # A tibble: 2 x 2
##   new_hire avg_salary
##   <chr>         <dbl>
## 1 No           73425.
## 2 Yes          76074.
# Perform the correct statistical test
t.test(salary ~ new_hire, data = pay)
## 
##  Welch Two Sample t-test
## 
## data:  salary by new_hire
## t = -2.3437, df = 685.16, p-value = 0.01938
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4869.4242  -429.9199
## sample estimates:
##  mean in group No mean in group Yes 
##          73424.60          76074.28
t.test(salary ~ new_hire, data = pay) %>%
  broom::tidy()
## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
## 1   -2650.    73425.    76074.     -2.34  0.0194      685.   -4869.
## # ... with 3 more variables: conf.high <dbl>, method <chr>,
## #   alternative <chr>
# Create a stacked bar chart
pay %>%
  ggplot(aes(x=new_hire, fill=job_level)) + 
  geom_bar(position="fill")

# Calculate the average salary for each group of interest
pay_grouped <- pay %>% 
  group_by(new_hire, job_level) %>% 
  summarize(avg_salary = mean(salary))
  
# Graph the results using facet_wrap()  
pay_grouped %>%
  ggplot(aes(x=new_hire, y=avg_salary)) + 
  geom_col() + 
  facet_wrap(~ job_level)

# Filter the data to include only hourly employees
pay_filter <- pay %>%
  filter(job_level == "Hourly")

# Test the difference in pay
t.test(salary ~ new_hire, data=pay_filter) %>%
  broom::tidy()
## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
## 1   -1107.    63966.    65073.     -1.75  0.0807      500.   -2349.
## # ... with 3 more variables: conf.high <dbl>, method <chr>,
## #   alternative <chr>
# Run the simple regression
model_simple <- lm(salary ~ new_hire, data = pay)

# Display the summary of model_simple
model_simple %>% 
  summary()
## 
## Call:
## lm(formula = salary ~ new_hire, data = pay)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32255 -14466  -3681  10740  87998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  73424.6      577.2 127.200   <2e-16 ***
## new_hireYes   2649.7     1109.4   2.388    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18900 on 1468 degrees of freedom
## Multiple R-squared:  0.003871,   Adjusted R-squared:  0.003193 
## F-statistic: 5.705 on 1 and 1468 DF,  p-value: 0.01704
# Display a tidy summary
model_simple %>% 
  broom::tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   73425.      577.    127.    0     
## 2 new_hireYes    2650.     1109.      2.39  0.0170
# Run the multiple regression
model_multiple <- lm(salary ~ new_hire + job_level, data = pay)

# Display the summary of model_multiple
model_multiple %>% 
  summary()
## 
## Call:
## lm(formula = salary ~ new_hire + job_level, data = pay)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21013  -7091   -425   6771  44322 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        64049.3      308.3 207.722   <2e-16 ***
## new_hireYes          782.7      524.8   1.491    0.136    
## job_levelManager   54918.8      915.3  60.001   <2e-16 ***
## job_levelSalaried  26865.6      567.2  47.369   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8930 on 1466 degrees of freedom
## Multiple R-squared:  0.7779, Adjusted R-squared:  0.7775 
## F-statistic:  1712 on 3 and 1466 DF,  p-value: < 2.2e-16
# Display a tidy summary
model_multiple %>% 
  broom::tidy()
## # A tibble: 4 x 5
##   term              estimate std.error statistic   p.value
##   <chr>                <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)         64049.      308.    208.   0.       
## 2 new_hireYes           783.      525.      1.49 1.36e-  1
## 3 job_levelManager    54919.      915.     60.0  0.       
## 4 job_levelSalaried   26866.      567.     47.4  7.39e-298

Chapter 4 - Are performance ratings being given consistently?

Joining HR data:

  • Employee data tend to be stored in different locations, requiring joins (merges) prior to running analyses
    • dplyr::left_join(hr_data, bonus_pay_data, by = “employee_id”)
    • All employees in hr_data will be kept, even if there is no matching record in bonus_pay_data
    • Employee ID (or similar) is by far the best way to join data - names tend to be non-unique and can differ in different systems

Performance ratings and fairness:

  • Performance ratings are inherently subjective and thus prone to bias
  • Unconscious bias is based on the brain’s heuristics, and may include preferences for members of various groups (biases, as reflected in hiring, promotion, etc.)

Logistic regression is especially helpful for modeling binary response variables:

  • glm(high_performer ~ salary, data = hr, family = “binomial”) %>% tidy()
  • glm(high_performer ~ salary + department, data = hr, family = “binomial”) %>% tidy()

Example code includes:

# Import the data
hr_data <- readr::read_csv("./RInputFiles/hr_data.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_double(),
##   department = col_character(),
##   job_level = col_character(),
##   gender = col_character()
## )
performance_data <- readr::read_csv("./RInputFiles/performance_data.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_double(),
##   rating = col_double()
## )
# Examine the datasets
summary(hr_data)
##   employee_id      department         job_level            gender         
##  Min.   :   1.0   Length:1470        Length:1470        Length:1470       
##  1st Qu.: 491.2   Class :character   Class :character   Class :character  
##  Median :1020.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1024.9                                                           
##  3rd Qu.:1555.8                                                           
##  Max.   :2068.0
summary(performance_data)
##   employee_id         rating    
##  Min.   :   1.0   Min.   :1.00  
##  1st Qu.: 491.2   1st Qu.:2.00  
##  Median :1020.5   Median :3.00  
##  Mean   :1024.9   Mean   :2.83  
##  3rd Qu.:1555.8   3rd Qu.:4.00  
##  Max.   :2068.0   Max.   :5.00
# Join the two tables
joined_data <- left_join(hr_data, performance_data, by = "employee_id")

# Examine the result
summary(joined_data)
##   employee_id      department         job_level            gender         
##  Min.   :   1.0   Length:1470        Length:1470        Length:1470       
##  1st Qu.: 491.2   Class :character   Class :character   Class :character  
##  Median :1020.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1024.9                                                           
##  3rd Qu.:1555.8                                                           
##  Max.   :2068.0                                                           
##      rating    
##  Min.   :1.00  
##  1st Qu.:2.00  
##  Median :3.00  
##  Mean   :2.83  
##  3rd Qu.:4.00  
##  Max.   :5.00
# Check whether the average performance rating differs by gender 
joined_data %>%
  group_by(gender) %>%
  summarize(avg_rating = mean(rating))
## # A tibble: 2 x 2
##   gender avg_rating
##   <chr>       <dbl>
## 1 Female       2.75
## 2 Male         2.92
# Add the high_performer column
performance <- joined_data %>%  
  mutate(high_performer = ifelse(rating >= 4, 1, 0))

# Test whether one gender is more likely to be a high performer
chisq.test(performance$gender, performance$high_performer)   
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  performance$gender and performance$high_performer
## X-squared = 22.229, df = 1, p-value = 2.42e-06
# Do the same test, and tidy the output
chisq.test(performance$gender, performance$high_performer) %>% broom::tidy()
## # A tibble: 1 x 4
##   statistic    p.value parameter method                                    
##       <dbl>      <dbl>     <int> <chr>                                     
## 1      22.2 0.00000242         1 Pearson's Chi-squared test with Yates' co~
# Visualize the distribution of high_performer by gender
performance %>%
  ggplot(aes(x=gender, fill=factor(high_performer))) + 
  geom_bar(position="fill")

# Visualize the distribution of all ratings by gender
performance %>%
  ggplot(aes(x=gender, fill=factor(rating))) + 
  geom_bar(position="fill")

# Visualize the distribution of job_level by gender
performance %>%
  ggplot(aes(x = gender, fill = job_level)) +
  geom_bar(position = "fill")

# Test whether men and women have different job level distributions
chisq.test(performance$gender, performance$job_level) 
## 
##  Pearson's Chi-squared test
## 
## data:  performance$gender and performance$job_level
## X-squared = 44.506, df = 2, p-value = 2.166e-10
# Visualize the distribution of high_performer by gender, faceted by job level
performance %>%
  ggplot(aes(x = gender, fill = factor(high_performer))) +
  geom_bar(position = "fill") + 
  facet_wrap(~ job_level)

# Run a simple logistic regression
logistic_simple <- glm(high_performer ~ gender, family = "binomial", data = performance) 

# View the result with summary()
logistic_simple %>%
  summary()
## 
## Call:
## glm(formula = high_performer ~ gender, family = "binomial", data = performance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8871  -0.8871  -0.6957   1.4986   1.7535  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.29540    0.08813 -14.699  < 2e-16 ***
## genderMale   0.56596    0.11921   4.748 2.06e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1709.0  on 1469  degrees of freedom
## Residual deviance: 1686.1  on 1468  degrees of freedom
## AIC: 1690.1
## 
## Number of Fisher Scoring iterations: 4
# View a tidy version of the result
logistic_simple %>%
  broom::tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -1.30     0.0881    -14.7  6.58e-49
## 2 genderMale     0.566    0.119       4.75 2.06e- 6
# Run a multiple logistic regression
logistic_multiple <- glm(high_performer ~ gender + job_level, family = "binomial", data = performance)

# View the result with summary() or tidy()
logistic_multiple %>% broom::tidy()
## # A tibble: 4 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)         -1.69      0.103    -16.5  2.74e-61
## 2 genderMale           0.319     0.129      2.47 1.34e- 2
## 3 job_levelManager     2.74      0.251     10.9  1.01e-27
## 4 job_levelSalaried    1.10      0.141      7.82 5.17e-15

Chapter 5 - Improving employee safety with data

Employee safety - looking at accident rates and drivers:

  • Requires joining data on multiple variables
    • joined_data <- left_join(hr_data, safety_data, by = c(“year”, “employee_id”))
    • joined_data %>% filter(is.na(accident_time)) # use is.na() instead

Focusing on the location of interest:

  • May want to run comparisons of the same location over time
  • May want to assess differences by locations to see if they may be explanatory variables

Explaining the increase in accidents:

  • Can use multiple regression to help test for explanatory variables that impact the accident rate

Wrap up:

  • Key tools from the Tidyverse (ggplot2, broom, dplyr, etc.) to assess HR data
  • Analytics usage within HR, including differences in HR and other data
  • Can apply additional data science techniques on HR data

Example code includes:

# Import the data 
hr_data <- readr::read_csv("./RInputFiles/hr_data_2.csv")
## Parsed with column specification:
## cols(
##   year = col_double(),
##   employee_id = col_double(),
##   location = col_character(),
##   overtime_hours = col_double()
## )
accident_data <- readr::read_csv("./RInputFiles/accident_data.csv")
## Parsed with column specification:
## cols(
##   year = col_double(),
##   employee_id = col_double(),
##   accident_type = col_character()
## )
# Create hr_joined with left_join() and mutate()
hr_joined <- left_join(hr_data, accident_data, by=c("year", "employee_id")) %>% 
  mutate(had_accident=ifelse(is.na(accident_type), 0, 1))
  
hr_joined
## # A tibble: 2,940 x 6
##     year employee_id location    overtime_hours accident_type had_accident
##    <dbl>       <dbl> <chr>                <dbl> <chr>                <dbl>
##  1  2016           1 Northwood               14 <NA>                     0
##  2  2017           1 Northwood                8 Mild                     1
##  3  2016           2 East Valley              8 <NA>                     0
##  4  2017           2 East Valley             11 <NA>                     0
##  5  2016           4 East Valley              4 <NA>                     0
##  6  2017           4 East Valley              2 Mild                     1
##  7  2016           5 West River               0 <NA>                     0
##  8  2017           5 West River              17 <NA>                     0
##  9  2016           7 West River              21 <NA>                     0
## 10  2017           7 West River               9 <NA>                     0
## # ... with 2,930 more rows
# Find accident rate for each year
hr_joined %>% 
  group_by(year) %>% 
  summarize(accident_rate = mean(had_accident))
## # A tibble: 2 x 2
##    year accident_rate
##   <dbl>         <dbl>
## 1  2016        0.0850
## 2  2017        0.120
# Test difference in accident rate between years
chisq.test(hr_joined$year, hr_joined$had_accident)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hr_joined$year and hr_joined$had_accident
## X-squared = 9.5986, df = 1, p-value = 0.001947
# Which location had the highest acccident rate?
hr_joined %>%
  group_by(location) %>%
  summarize(accident_rate=mean(had_accident)) %>%
  arrange(-accident_rate)
## # A tibble: 4 x 2
##   location    accident_rate
##   <chr>               <dbl>
## 1 East Valley        0.128 
## 2 Southfield         0.103 
## 3 West River         0.0961
## 4 Northwood          0.0831
# Compare annual accident rates by location
accident_rates <- hr_joined %>% 
  group_by(location, year) %>% 
  summarize(accident_rate = mean(had_accident))
  
accident_rates
## # A tibble: 8 x 3
## # Groups:   location [4]
##   location     year accident_rate
##   <chr>       <dbl>         <dbl>
## 1 East Valley  2016        0.113 
## 2 East Valley  2017        0.143 
## 3 Northwood    2016        0.0644
## 4 Northwood    2017        0.102 
## 5 Southfield   2016        0.0764
## 6 Southfield   2017        0.130 
## 7 West River   2016        0.0824
## 8 West River   2017        0.110
# Graph it
accident_rates %>% 
  ggplot(aes(factor(year), accident_rate)) +
  geom_col() +
  facet_wrap(~location)

# Filter out the other locations
southfield <- hr_joined %>% 
  filter(location == "Southfield")

# Find the average overtime hours worked by year
southfield %>%
  group_by(year) %>% 
  summarize(average_overtime_hours = mean(overtime_hours))
## # A tibble: 2 x 2
##    year average_overtime_hours
##   <dbl>                  <dbl>
## 1  2016                   8.67
## 2  2017                   9.97
# Test difference in Southfield's overtime hours between years
t.test(overtime_hours ~ year, data=southfield) 
## 
##  Welch Two Sample t-test
## 
## data:  overtime_hours by year
## t = -1.6043, df = 595.46, p-value = 0.1092
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.904043  0.292747
## sample estimates:
## mean in group 2016 mean in group 2017 
##           8.667774           9.973422
# Import the survey data
survey_data <- readr::read_csv("./RInputFiles/survey_data_2.csv")
## Parsed with column specification:
## cols(
##   year = col_double(),
##   employee_id = col_double(),
##   engagement = col_double()
## )
# Create the safety dataset
safety <- left_join(hr_joined, survey_data, by=c("employee_id", "year")) %>%
  mutate(disengaged=ifelse(engagement <= 2, 1, 0), year=factor(year))


# Visualize the difference in % disengaged by year in Southfield
safety %>% 
    filter(location=="Southfield") %>%
    ggplot(aes(x = year, fill = factor(disengaged))) +
    geom_bar(position = "fill")

# Test whether one year had significantly more disengaged employees
southSafety <- safety %>% 
    filter(location=="Southfield")
chisq.test(southSafety$disengaged, southSafety$year)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  southSafety$disengaged and southSafety$year
## X-squared = 7.1906, df = 1, p-value = 0.007329
# Filter out Southfield
other_locs <- safety %>% 
  filter(location != "Southfield")

# Test whether one year had significantly more overtime hours worked
t.test(overtime_hours ~ year, data = other_locs) 
## 
##  Welch Two Sample t-test
## 
## data:  overtime_hours by year
## t = -0.48267, df = 2320.3, p-value = 0.6294
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9961022  0.6026035
## sample estimates:
## mean in group 2016 mean in group 2017 
##           9.278015           9.474765
# Test whether one year had significantly more disengaged employees
chisq.test(other_locs$year, other_locs$disengaged)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  other_locs$year and other_locs$disengaged
## X-squared = 0.0023091, df = 1, p-value = 0.9617
# Use multiple regression to test the impact of year and disengaged on accident rate in Southfield
regression <- glm(had_accident ~ year + disengaged, family = "binomial", data = southSafety)

# Examine the output
regression %>% broom::tidy()
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -2.92      0.250    -11.7  1.74e-31
## 2 year2017       0.440     0.285      1.55 1.22e- 1
## 3 disengaged     1.44      0.278      5.19 2.13e- 7

Supervised Learning in R: Case Studies

Chapter 1 - Cars Data

Making predictions using machine learning:

  • Course focuses on applied skills from predictive learning, using regression and classification as well as EDA
    • Regression tends to be for predicting continuous, numeric variables
    • Classification tends to be for predicting categorical variables
  • Case studies include 1) fuel efficiency, 2) Stack Overflow developer survey, 3) voter turnout, and 4) ages of nuns
  • The fuel efficiency data is stored in cars2018 and is based on data from the US Department of Energy
    • Variables names with spaces can be handled by surrounding them with backticks
    • Tidyverse includes tibble, readr, ggplot2, dplyr, tidyr, purrr, etc. - can be loaded as a package using library(tidyverse)

Getting started with caret:

  • The caret package is useful for predictive modeling - full process including the test/train split for the raw dataset
    • in_train <- createDataPartition(cars_vars$Aspiration, p = 0.8, list = FALSE) # will stratify on ‘Aspiration’ variable
    • training <- cars_vars[in_train,]
    • testing <- cars_vars[-in_train,]
  • Can then train the model using only the training dataset
    • fit_lm <- train(log(MPG) ~ ., method = “lm”, data=training, trControl=trainControl(method = “none”))
    • Can then use the yardstick package to assess the quality of the model

Sampling data:

  • Bootstrap resampling means sampling with replacement, and then fitting on the resampled dataset (run multiple times)
    • cars_rf_bt <- train(log(MPG) ~ ., method = “rf”, data = training, trControl = trainControl(method = “boot”)) # default 25 resamples
    • Can both visualize the models and assess the model statistically

Example code includes:

cars2018 <- readr::read_csv("./RInputFiles/cars2018.csv")
## Parsed with column specification:
## cols(
##   Model = col_character(),
##   `Model Index` = col_integer(),
##   Displacement = col_double(),
##   Cylinders = col_integer(),
##   Gears = col_integer(),
##   Transmission = col_character(),
##   MPG = col_integer(),
##   Aspiration = col_character(),
##   `Lockup Torque Converter` = col_character(),
##   Drive = col_character(),
##   `Max Ethanol` = col_integer(),
##   `Recommended Fuel` = col_character(),
##   `Intake Valves Per Cyl` = col_integer(),
##   `Exhaust Valves Per Cyl` = col_integer(),
##   `Fuel injection` = col_character()
## )
str(cars2018, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1144 obs. of  15 variables:
##  $ Model                  : chr  "Acura NSX" "ALFA ROMEO 4C" "Audi R8 AWD" "Audi R8 RWD" ...
##  $ Model Index            : int  57 410 65 71 66 72 46 488 38 278 ...
##  $ Displacement           : num  3.5 1.8 5.2 5.2 5.2 5.2 2 3 8 6.2 ...
##  $ Cylinders              : int  6 4 10 10 10 10 4 6 16 8 ...
##  $ Gears                  : int  9 6 7 7 7 7 6 7 7 8 ...
##  $ Transmission           : chr  "Manual" "Manual" "Manual" "Manual" ...
##  $ MPG                    : int  21 28 17 18 17 18 26 20 11 18 ...
##  $ Aspiration             : chr  "Turbocharged/Supercharged" "Turbocharged/Supercharged" "Naturally Aspirated" "Naturally Aspirated" ...
##  $ Lockup Torque Converter: chr  "Y" "Y" "Y" "Y" ...
##  $ Drive                  : chr  "All Wheel Drive" "2-Wheel Drive, Rear" "All Wheel Drive" "2-Wheel Drive, Rear" ...
##  $ Max Ethanol            : int  10 10 15 15 15 15 15 10 15 10 ...
##  $ Recommended Fuel       : chr  "Premium Unleaded Required" "Premium Unleaded Required" "Premium Unleaded Recommended" "Premium Unleaded Recommended" ...
##  $ Intake Valves Per Cyl  : int  2 2 2 2 2 2 2 2 2 1 ...
##  $ Exhaust Valves Per Cyl : int  2 2 2 2 2 2 2 2 2 1 ...
##  $ Fuel injection         : chr  "Direct ignition" "Direct ignition" "Direct ignition" "Direct ignition" ...
summary(cars2018)
##     Model            Model Index     Displacement     Cylinders     
##  Length:1144        Min.   :  1.0   Min.   :1.000   Min.   : 3.000  
##  Class :character   1st Qu.: 36.0   1st Qu.:2.000   1st Qu.: 4.000  
##  Mode  :character   Median :108.0   Median :3.000   Median : 6.000  
##                     Mean   :201.3   Mean   :3.087   Mean   : 5.564  
##                     3rd Qu.:323.8   3rd Qu.:3.600   3rd Qu.: 6.000  
##                     Max.   :821.0   Max.   :8.000   Max.   :16.000  
##      Gears        Transmission            MPG        Aspiration       
##  Min.   : 1.000   Length:1144        Min.   :11.0   Length:1144       
##  1st Qu.: 6.000   Class :character   1st Qu.:19.0   Class :character  
##  Median : 7.000   Mode  :character   Median :23.0   Mode  :character  
##  Mean   : 6.935                      Mean   :23.2                     
##  3rd Qu.: 8.000                      3rd Qu.:26.0                     
##  Max.   :10.000                      Max.   :58.0                     
##  Lockup Torque Converter    Drive            Max Ethanol   
##  Length:1144             Length:1144        Min.   :10.00  
##  Class :character        Class :character   1st Qu.:10.00  
##  Mode  :character        Mode  :character   Median :10.00  
##                                             Mean   :15.29  
##                                             3rd Qu.:15.00  
##                                             Max.   :85.00  
##  Recommended Fuel   Intake Valves Per Cyl Exhaust Valves Per Cyl
##  Length:1144        Min.   :1.000         Min.   :1.000         
##  Class :character   1st Qu.:2.000         1st Qu.:2.000         
##  Mode  :character   Median :2.000         Median :2.000         
##                     Mean   :1.926         Mean   :1.922         
##                     3rd Qu.:2.000         3rd Qu.:2.000         
##                     Max.   :2.000         Max.   :2.000         
##  Fuel injection    
##  Length:1144       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Print the cars2018 object
cars2018
## # A tibble: 1,144 x 15
##    Model     `Model Index` Displacement Cylinders Gears Transmission   MPG
##    <chr>             <int>        <dbl>     <int> <int> <chr>        <int>
##  1 Acura NSX            57         3.50         6     9 Manual          21
##  2 ALFA ROM~           410         1.80         4     6 Manual          28
##  3 Audi R8 ~            65         5.20        10     7 Manual          17
##  4 Audi R8 ~            71         5.20        10     7 Manual          18
##  5 Audi R8 ~            66         5.20        10     7 Manual          17
##  6 Audi R8 ~            72         5.20        10     7 Manual          18
##  7 Audi TT ~            46         2.00         4     6 Manual          26
##  8 BMW M4 D~           488         3.00         6     7 Manual          20
##  9 Bugatti ~            38         8.00        16     7 Manual          11
## 10 Chevrole~           278         6.20         8     8 Automatic       18
## # ... with 1,134 more rows, and 8 more variables: Aspiration <chr>,
## #   `Lockup Torque Converter` <chr>, Drive <chr>, `Max Ethanol` <int>,
## #   `Recommended Fuel` <chr>, `Intake Valves Per Cyl` <int>, `Exhaust
## #   Valves Per Cyl` <int>, `Fuel injection` <chr>
# Plot the histogram
ggplot(cars2018, aes(x = MPG)) +
    geom_histogram(bins = 25) +
    labs(y = "Number of cars",
         x = "Fuel efficiency (mpg)")

# Deselect the 2 columns to create cars_vars
cars_vars <- cars2018 %>%
    select(-Model, -`Model Index`)

# Fit a linear model
fit_all <- lm(MPG ~ ., data = cars_vars)

# Print the summary of the model
summary(fit_all)
## 
## Call:
## lm(formula = MPG ~ ., data = cars_vars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5261 -1.6473 -0.1096  1.3572 26.5045 
## 
## Coefficients:
##                                                 Estimate Std. Error
## (Intercept)                                    44.539519   1.176283
## Displacement                                   -3.786147   0.264845
## Cylinders                                       0.520284   0.161802
## Gears                                           0.157674   0.069984
## TransmissionCVT                                 4.877637   0.404051
## TransmissionManual                             -1.074608   0.366075
## AspirationTurbocharged/Supercharged            -2.190248   0.267559
## `Lockup Torque Converter`Y                     -2.624494   0.381252
## Drive2-Wheel Drive, Rear                       -2.676716   0.291044
## Drive4-Wheel Drive                             -3.397532   0.335147
## DriveAll Wheel Drive                           -2.941084   0.257174
## `Max Ethanol`                                  -0.007377   0.005898
## `Recommended Fuel`Premium Unleaded Required    -0.403935   0.262413
## `Recommended Fuel`Regular Unleaded Recommended -0.996343   0.272495
## `Intake Valves Per Cyl`                        -1.446107   1.620575
## `Exhaust Valves Per Cyl`                       -2.469747   1.547748
## `Fuel injection`Multipoint/sequential ignition -0.658428   0.243819
##                                                t value Pr(>|t|)    
## (Intercept)                                     37.865  < 2e-16 ***
## Displacement                                   -14.296  < 2e-16 ***
## Cylinders                                        3.216 0.001339 ** 
## Gears                                            2.253 0.024450 *  
## TransmissionCVT                                 12.072  < 2e-16 ***
## TransmissionManual                              -2.935 0.003398 ** 
## AspirationTurbocharged/Supercharged             -8.186 7.24e-16 ***
## `Lockup Torque Converter`Y                      -6.884 9.65e-12 ***
## Drive2-Wheel Drive, Rear                        -9.197  < 2e-16 ***
## Drive4-Wheel Drive                             -10.137  < 2e-16 ***
## DriveAll Wheel Drive                           -11.436  < 2e-16 ***
## `Max Ethanol`                                   -1.251 0.211265    
## `Recommended Fuel`Premium Unleaded Required     -1.539 0.124010    
## `Recommended Fuel`Regular Unleaded Recommended  -3.656 0.000268 ***
## `Intake Valves Per Cyl`                         -0.892 0.372400    
## `Exhaust Valves Per Cyl`                        -1.596 0.110835    
## `Fuel injection`Multipoint/sequential ignition  -2.700 0.007028 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.916 on 1127 degrees of freedom
## Multiple R-squared:  0.7314, Adjusted R-squared:  0.7276 
## F-statistic: 191.8 on 16 and 1127 DF,  p-value: < 2.2e-16
# Load caret
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# Split the data into training and test sets
set.seed(1234)
in_train <- createDataPartition(cars_vars$Transmission, p = 0.8, list = FALSE)
training <- cars_vars[in_train, ]
testing <- cars_vars[-in_train, ]

# Train a linear regression model
fit_lm <- train(log(MPG) ~ ., method = "lm", data = training,
                trControl = trainControl(method = "none"))

# Print the model object
fit_lm
## Linear Regression 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: None
# Train a random forest model
fit_rf <- train(log(MPG) ~ ., method = "rf", data = training,
                trControl = trainControl(method = "none"))

# Print the model object
fit_rf
## Random Forest 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: None
# Create the new columns
results <- training %>%
    mutate(`Linear regression` = predict(fit_lm, training),
           `Random forest` = predict(fit_rf, training))

# Evaluate the performance
yardstick::metrics(results, truth = MPG, estimate = `Linear regression`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.9 0.702
yardstick::metrics(results, truth = MPG, estimate = `Random forest`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.9 0.845
# Create the new columns
results <- testing %>%
    mutate(`Linear regression` = predict(fit_lm, testing),
           `Random forest` = predict(fit_rf, testing))

# Evaluate the performance
yardstick::metrics(results, truth = MPG, estimate = `Linear regression`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.799
yardstick::metrics(results, truth = MPG, estimate = `Random forest`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.880
# Fit the models with bootstrap resampling
cars_lm_bt <- train(log(MPG) ~ ., method = "lm", data = training,
                   trControl = trainControl(method = "boot"))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
cars_rf_bt <- train(log(MPG) ~ ., method = "rf", data = training,
                   trControl = trainControl(method = "boot"))
                   
# Quick look at the models
cars_lm_bt
## Linear Regression 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 916, 916, 916, 916, 916, 916, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE       
##   0.1036278  0.7890514  0.07656104
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
cars_rf_bt
## Random Forest 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 916, 916, 916, 916, 916, 916, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared   MAE       
##    2    0.10015480  0.8205322  0.07299305
##    9    0.08758544  0.8466598  0.06129895
##   16    0.09100659  0.8360034  0.06313542
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.
results <- testing %>%
    mutate(`Linear regression` = predict(cars_lm_bt, testing),
           `Random forest` = predict(cars_rf_bt, testing))

yardstick::metrics(results, truth = MPG, estimate = `Linear regression`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.799
yardstick::metrics(results, truth = MPG, estimate = `Random forest`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.903
results %>%
    gather(Method, Result, `Linear regression`:`Random forest`) %>%
    ggplot(aes(log(MPG), Result, color = Method)) +
    geom_point(size = 1.5, alpha = 0.5) +
    facet_wrap(~Method) +
    geom_abline(lty = 2, color = "gray50") +
    geom_smooth(method = "lm")


Chapter 2 - Stack Overflow Developer Data

Essential copying and pasting from Stack Overflow (largest and most trusted developer community):

  • Annual survey of developer perspectives on Stack Overflow - can be used for predictive modeling
  • Data is made available publicly at insights.stackoverflow.com/survey
  • Key question is “what makes a developer more likely to work remotely” (size of company, geography of employee, etc.)
    • Data are calss imbalanced, with many more Non-Remote employees than Remote employees
    • Best first step is the simplest model - logit, without any tricks
    • simple_glm <- stackoverflow %>% select(-Respondent) %>% glm(Remote ~ ., family = “binomial”, + data = .) # Remote ~ . Means “all variables” while data=. Means from the piped dataset

Dealing with imbalanced data:

  • Class imbalance is a common problem that can negatively impact model performance
    • This dataset has 10x the number of non-remote, which can influence models to just start predicting non-remote in all cases
  • One approach to class imbalance is upsampling, basically running resampling with replacement on the small class until it is the same size as the large class
    • Simple to implement, but with the risk of over-fitting
    • up_train <- upSample(x = select(training, -Remote), y = training$Remote, yname = “Remote”) %>% as_tibble()
    • stack_glm <- train(Remote ~ ., method = “glm”, family = “binomial”, data = training, trControl = trainControl(method = “boot”, sampling = “up”))

Predicting remote status:

  • Classification models can include logistic regression and random forests
    • stack_glm <- train(Remote ~ ., method = “glm”, family = “binomial”, data = training, trControl = trainControl(method = “boot”, sampling = “up”))
    • stack_rf <- train(Remote ~ ., method = “rf”, data = training, trControl = trainControl(method = “boot”, sampling = “up”))
  • Classification models can be evaluated using the confusion matrix
    • confusionMatrix(predict(stack_glm, testing), testing$Remote)
    • yardstick::accuracy(testing_results, truth = Remote, estimate = Logistic regression)
    • yardstick::ppv(testing_results, truth = Remote, estimate = Logistic regression)
    • yardstick::npv(testing_results, truth = Remote, estimate = Logistic regression)

Example code includes:

stackoverflow <- readr::read_csv("./RInputFiles/stackoverflow.csv")
## Parsed with column specification:
## cols(
##   .default = col_logical(),
##   Respondent = col_integer(),
##   Country = col_character(),
##   Salary = col_double(),
##   YearsCodedJob = col_integer(),
##   CompanySizeNumber = col_double(),
##   Remote = col_character(),
##   CareerSatisfaction = col_integer()
## )
## See spec(...) for full column specifications.
stackoverflow$Remote <- factor(stackoverflow$Remote, levels=c("Not remote", "Remote"))
str(stackoverflow, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6991 obs. of  22 variables:
##  $ Respondent                          : int  3 15 18 19 26 55 62 71 73 77 ...
##  $ Country                             : chr  "United Kingdom" "United Kingdom" "United States" "United States" ...
##  $ Salary                              : num  113750 100000 130000 82500 175000 ...
##  $ YearsCodedJob                       : int  20 20 20 3 16 4 1 1 20 20 ...
##  $ OpenSource                          : logi  TRUE FALSE TRUE FALSE FALSE FALSE ...
##  $ Hobby                               : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ CompanySizeNumber                   : num  10000 5000 1000 10000 10000 1000 5000 20 100 1000 ...
##  $ Remote                              : Factor w/ 2 levels "Not remote","Remote": 1 2 2 1 1 1 1 1 2 2 ...
##  $ CareerSatisfaction                  : int  8 8 9 5 7 9 5 8 8 10 ...
##  $ Data scientist                      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Database administrator              : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Desktop applications developer      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Developer with stats/math background: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ DevOps                              : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ Embedded developer                  : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ Graphic designer                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Graphics programming                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Machine learning specialist         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Mobile developer                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Quality assurance engineer          : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ Systems administrator               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Web developer                       : logi  FALSE FALSE TRUE TRUE TRUE TRUE ...
# Print stackoverflow
stackoverflow
## # A tibble: 6,991 x 22
##    Respondent Country        Salary YearsCodedJob OpenSource Hobby
##         <int> <chr>           <dbl>         <int> <lgl>      <lgl>
##  1          3 United Kingdom 113750            20 T          T    
##  2         15 United Kingdom 100000            20 F          T    
##  3         18 United States  130000            20 T          T    
##  4         19 United States   82500             3 F          T    
##  5         26 United States  175000            16 F          T    
##  6         55 Germany         64516             4 F          F    
##  7         62 India            6636             1 F          T    
##  8         71 United States   65000             1 F          T    
##  9         73 United States  120000            20 T          T    
## 10         77 United States   96283            20 T          T    
## # ... with 6,981 more rows, and 16 more variables:
## #   CompanySizeNumber <dbl>, Remote <fct>, CareerSatisfaction <int>, `Data
## #   scientist` <lgl>, `Database administrator` <lgl>, `Desktop
## #   applications developer` <lgl>, `Developer with stats/math
## #   background` <lgl>, DevOps <lgl>, `Embedded developer` <lgl>, `Graphic
## #   designer` <lgl>, `Graphics programming` <lgl>, `Machine learning
## #   specialist` <lgl>, `Mobile developer` <lgl>, `Quality assurance
## #   engineer` <lgl>, `Systems administrator` <lgl>, `Web developer` <lgl>
# First count for Remote
stackoverflow %>% 
    count(Remote, sort = TRUE)
## # A tibble: 2 x 2
##   Remote         n
##   <fct>      <int>
## 1 Not remote  6273
## 2 Remote       718
# then count for Country
stackoverflow %>% 
    count(Country, sort = TRUE)
## # A tibble: 5 x 2
##   Country            n
##   <chr>          <int>
## 1 United States   3486
## 2 United Kingdom  1270
## 3 Germany          950
## 4 India            666
## 5 Canada           619
ggplot(stackoverflow, aes(x=Remote, y=YearsCodedJob)) +
    geom_boxplot() +
    labs(x = NULL,
         y = "Years of professional coding experience") 

# Build a simple logistic regression model
simple_glm <- stackoverflow %>%
        select(-Respondent) %>%
        glm(Remote ~ .,
            family = "binomial",
            data = .)

# Print the summary of the model
summary(simple_glm)
## 
## Call:
## glm(formula = Remote ~ ., family = "binomial", data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1942  -0.4971  -0.3824  -0.2867   2.9118  
## 
## Coefficients:
##                                              Estimate Std. Error z value
## (Intercept)                                -4.156e+00  2.929e-01 -14.187
## CountryGermany                             -2.034e-01  2.196e-01  -0.927
## CountryIndia                                9.574e-01  2.220e-01   4.312
## CountryUnited Kingdom                       5.599e-02  1.974e-01   0.284
## CountryUnited States                        5.990e-01  1.799e-01   3.330
## Salary                                      4.076e-06  1.589e-06   2.565
## YearsCodedJob                               7.133e-02  7.556e-03   9.440
## OpenSourceTRUE                              4.207e-01  8.555e-02   4.917
## HobbyTRUE                                   8.330e-03  9.827e-02   0.085
## CompanySizeNumber                          -6.104e-05  1.223e-05  -4.990
## CareerSatisfaction                          6.748e-02  2.664e-02   2.533
## `Data scientist`TRUE                       -1.186e-01  1.838e-01  -0.645
## `Database administrator`TRUE                2.763e-01  1.267e-01   2.181
## `Desktop applications developer`TRUE       -2.903e-01  9.842e-02  -2.950
## `Developer with stats/math background`TRUE  2.840e-02  1.359e-01   0.209
## DevOpsTRUE                                 -1.532e-01  1.292e-01  -1.185
## `Embedded developer`TRUE                   -2.777e-01  1.653e-01  -1.680
## `Graphic designer`TRUE                     -1.904e-01  2.725e-01  -0.699
## `Graphics programming`TRUE                  1.078e-01  2.312e-01   0.466
## `Machine learning specialist`TRUE          -2.289e-01  2.769e-01  -0.827
## `Mobile developer`TRUE                      2.170e-01  1.019e-01   2.130
## `Quality assurance engineer`TRUE           -2.826e-01  2.437e-01  -1.160
## `Systems administrator`TRUE                 1.462e-01  1.421e-01   1.029
## `Web developer`TRUE                         1.158e-01  9.993e-02   1.159
##                                            Pr(>|z|)    
## (Intercept)                                 < 2e-16 ***
## CountryGermany                             0.354161    
## CountryIndia                               1.62e-05 ***
## CountryUnited Kingdom                      0.776710    
## CountryUnited States                       0.000868 ***
## Salary                                     0.010314 *  
## YearsCodedJob                               < 2e-16 ***
## OpenSourceTRUE                             8.78e-07 ***
## HobbyTRUE                                  0.932444    
## CompanySizeNumber                          6.04e-07 ***
## CareerSatisfaction                         0.011323 *  
## `Data scientist`TRUE                       0.518709    
## `Database administrator`TRUE               0.029184 *  
## `Desktop applications developer`TRUE       0.003178 ** 
## `Developer with stats/math background`TRUE 0.834400    
## DevOpsTRUE                                 0.235833    
## `Embedded developer`TRUE                   0.093039 .  
## `Graphic designer`TRUE                     0.484596    
## `Graphics programming`TRUE                 0.641060    
## `Machine learning specialist`TRUE          0.408484    
## `Mobile developer`TRUE                     0.033194 *  
## `Quality assurance engineer`TRUE           0.246098    
## `Systems administrator`TRUE                0.303507    
## `Web developer`TRUE                        0.246655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4627.8  on 6990  degrees of freedom
## Residual deviance: 4268.8  on 6967  degrees of freedom
## AIC: 4316.8
## 
## Number of Fisher Scoring iterations: 5
stack_select <- stackoverflow %>%
    select(-Respondent)

# Split the data into training and testing sets
set.seed(1234)
in_train <- caret::createDataPartition(stack_select$Remote, p=0.8, list = FALSE)
training <- stack_select[in_train,]
testing <- stack_select[-in_train,]


up_train <- caret::upSample(x = select(training, -Remote), y = training$Remote, yname = "Remote") %>%
    as_tibble()

up_train %>%
    count(Remote)
## # A tibble: 2 x 2
##   Remote         n
##   <fct>      <int>
## 1 Not remote  5019
## 2 Remote      5019
# Sub-sample to 5% of original
inUse <- sample(1:nrow(training), round(0.05*nrow(training)), replace=FALSE)
useTrain <- training[sort(inUse), ]

# Build a logistic regression model
stack_glm <- caret::train(Remote ~ ., method="glm", family="binomial", data = training, 
                          trControl = trainControl(method = "boot", sampling = "up")
                          )

# Print the model object 
stack_glm
## Generalized Linear Model 
## 
## 5594 samples
##   20 predictor
##    2 classes: 'Not remote', 'Remote' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 5594, 5594, 5594, 5594, 5594, 5594, ... 
## Addtional sampling using up-sampling
## 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6568743  0.1279825
# Build a random forest model
stack_rf <- caret::train(Remote ~ ., method="rf", data = useTrain, 
                         trControl = trainControl(method = "boot", sampling="up")
                         )

# Print the model object
stack_rf
## Random Forest 
## 
## 280 samples
##  20 predictor
##   2 classes: 'Not remote', 'Remote' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 280, 280, 280, 280, 280, 280, ... 
## Addtional sampling using up-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa       
##    2    0.8626254   0.110738058
##   12    0.9038825  -0.002127159
##   23    0.8887612   0.035777206
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 12.
# Confusion matrix for logistic regression model
caret::confusionMatrix(predict(stack_glm, testing), testing$Remote)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Not remote Remote
##   Not remote        837     53
##   Remote            417     90
##                                           
##                Accuracy : 0.6636          
##                  95% CI : (0.6381, 0.6883)
##     No Information Rate : 0.8976          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1395          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6675          
##             Specificity : 0.6294          
##          Pos Pred Value : 0.9404          
##          Neg Pred Value : 0.1775          
##              Prevalence : 0.8976          
##          Detection Rate : 0.5991          
##    Detection Prevalence : 0.6371          
##       Balanced Accuracy : 0.6484          
##                                           
##        'Positive' Class : Not remote      
## 
# Confusion matrix for random forest model
caret::confusionMatrix(predict(stack_rf, testing), testing$Remote)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Not remote Remote
##   Not remote       1207    125
##   Remote             47     18
##                                           
##                Accuracy : 0.8769          
##                  95% CI : (0.8585, 0.8937)
##     No Information Rate : 0.8976          
##     P-Value [Acc > NIR] : 0.9945          
##                                           
##                   Kappa : 0.1166          
##  Mcnemar's Test P-Value : 4.327e-09       
##                                           
##             Sensitivity : 0.9625          
##             Specificity : 0.1259          
##          Pos Pred Value : 0.9062          
##          Neg Pred Value : 0.2769          
##              Prevalence : 0.8976          
##          Detection Rate : 0.8640          
##    Detection Prevalence : 0.9535          
##       Balanced Accuracy : 0.5442          
##                                           
##        'Positive' Class : Not remote      
## 
# Predict values
testing_results <- testing %>%
    mutate(`Logistic regression` = predict(stack_glm, testing), `Random forest` = predict(stack_rf, testing))

## Calculate accuracy
yardstick::accuracy(testing_results, truth = Remote, estimate = `Logistic regression`)
## [1] 0.6635648
yardstick::accuracy(testing_results, truth = Remote, estimate = `Random forest`)
## [1] 0.876879
## Calculate positive predict value
yardstick::ppv(testing_results, truth = Remote, estimate = `Logistic regression`)
## [1] 0.9404494
yardstick::ppv(testing_results, truth = Remote, estimate = `Random forest`)
## [1] 0.9061562

Chapter 3 - Voting

Predicting voter turnout from survey data:

  • Survey data available from https://www.voterstudygroup.org/publications/2016-elections/data
    • Opinions about political and economic topics
    • Includes whether the voter turned out (voted), based on self-reporting, in the 2016 election
    • Data are coded as integers, requiring a data dictionary to map the questions and responses to what they mean

Vote 2016:

  • Exploratory data analysis will help with learning about the underlying dataset
    • There are differences on many of the individual dimensions between voters and non-voters
    • A good first step can be to start with the very simplest model, Dependent ~ .

Cross-validation is the process of sub-dividing the data into folds, with each fold used once as the validation set:

  • Allows for more accurate estimates of model performance on out-of-sample error
  • Each process of CV will work through the data k times (assuming there are k folds)
    • Repeated CV is the process of running CV multiple times (this is particularly well suited to parallel processing)

Comparing model performance:

  • Random forest models tend to be more powerful and capable of classifying the training data (and thus subject to risk of overfits and associated poor quality of test set predictions)

Example code includes:

voters <- readr::read_csv("./RInputFiles/voters.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   turnout16_2016 = col_character()
## )
## See spec(...) for full column specifications.
voters$turnout16_2016 <- factor(voters$turnout16_2016, levels=c("Did not vote", "Voted"))
str(voters, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6692 obs. of  43 variables:
##  $ case_identifier     : int  779 2108 2597 4148 4460 5225 5903 6059 8048 13112 ...
##  $ turnout16_2016      : Factor w/ 2 levels "Did not vote",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RIGGED_SYSTEM_1_2016: int  3 2 2 1 3 3 3 2 4 2 ...
##  $ RIGGED_SYSTEM_2_2016: int  4 1 4 4 1 3 4 3 4 3 ...
##  $ RIGGED_SYSTEM_3_2016: int  1 3 1 1 3 2 1 3 1 1 ...
##  $ RIGGED_SYSTEM_4_2016: int  4 1 4 4 1 2 1 2 3 2 ...
##  $ RIGGED_SYSTEM_5_2016: int  3 3 1 2 3 2 2 1 3 2 ...
##  $ RIGGED_SYSTEM_6_2016: int  2 2 1 1 2 3 1 2 1 2 ...
##  $ track_2016          : int  2 2 1 1 2 2 1 2 2 2 ...
##  $ persfinretro_2016   : int  2 3 3 1 2 2 2 3 2 1 ...
##  $ econtrend_2016      : int  1 3 3 1 2 2 1 3 1 1 ...
##  $ Americatrend_2016   : int  1 1 1 3 3 1 2 3 2 1 ...
##  $ futuretrend_2016    : int  4 1 1 3 4 3 1 3 1 1 ...
##  $ wealth_2016         : int  2 1 2 2 1 2 2 1 2 2 ...
##  $ values_culture_2016 : int  2 3 3 3 3 2 3 3 1 3 ...
##  $ US_respect_2016     : int  2 3 1 1 2 2 2 3 3 3 ...
##  $ trustgovt_2016      : int  3 3 3 3 3 2 3 3 3 3 ...
##  $ trust_people_2016   : int  8 2 1 1 1 2 2 1 2 1 ...
##  $ helpful_people_2016 : int  1 1 2 1 1 1 2 2 1 2 ...
##  $ fair_people_2016    : int  8 2 1 1 1 2 2 1 2 1 ...
##  $ imiss_a_2016        : int  2 1 1 1 1 2 1 1 3 1 ...
##  $ imiss_b_2016        : int  2 1 1 2 1 1 1 2 1 1 ...
##  $ imiss_c_2016        : int  1 2 2 3 1 2 2 1 4 2 ...
##  $ imiss_d_2016        : int  1 2 1 1 1 1 1 2 1 1 ...
##  $ imiss_e_2016        : int  1 1 3 1 1 3 1 2 1 1 ...
##  $ imiss_f_2016        : int  2 1 1 2 1 2 1 3 2 1 ...
##  $ imiss_g_2016        : int  1 4 3 3 3 1 3 4 2 2 ...
##  $ imiss_h_2016        : int  1 2 2 2 1 1 1 2 1 1 ...
##  $ imiss_i_2016        : int  2 2 4 4 2 1 1 3 2 1 ...
##  $ imiss_j_2016        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ imiss_k_2016        : int  1 2 1 1 2 1 1 4 2 1 ...
##  $ imiss_l_2016        : int  1 4 1 2 4 1 1 3 1 1 ...
##  $ imiss_m_2016        : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ imiss_n_2016        : int  1 2 1 1 1 1 1 2 2 1 ...
##  $ imiss_o_2016        : int  2 1 1 1 1 2 1 2 2 1 ...
##  $ imiss_p_2016        : int  2 1 2 3 1 3 1 1 4 1 ...
##  $ imiss_q_2016        : int  1 1 1 2 2 1 1 4 2 1 ...
##  $ imiss_r_2016        : int  2 1 1 2 1 2 1 2 4 2 ...
##  $ imiss_s_2016        : int  1 2 1 2 2 1 1 1 1 1 ...
##  $ imiss_t_2016        : int  1 1 3 3 1 1 3 4 1 1 ...
##  $ imiss_u_2016        : int  2 2 2 2 1 3 3 1 4 2 ...
##  $ imiss_x_2016        : int  1 3 1 2 1 1 1 4 1 1 ...
##  $ imiss_y_2016        : int  1 4 2 3 1 1 1 3 2 1 ...
# Print voters
voters
## # A tibble: 6,692 x 43
##    case_identifier turnout16_2016 RIGGED_SYSTEM_1_2016 RIGGED_SYSTEM_2_20~
##              <int> <fct>                         <int>               <int>
##  1             779 Voted                             3                   4
##  2            2108 Voted                             2                   1
##  3            2597 Voted                             2                   4
##  4            4148 Voted                             1                   4
##  5            4460 Voted                             3                   1
##  6            5225 Voted                             3                   3
##  7            5903 Voted                             3                   4
##  8            6059 Voted                             2                   3
##  9            8048 Voted                             4                   4
## 10           13112 Voted                             2                   3
## # ... with 6,682 more rows, and 39 more variables:
## #   RIGGED_SYSTEM_3_2016 <int>, RIGGED_SYSTEM_4_2016 <int>,
## #   RIGGED_SYSTEM_5_2016 <int>, RIGGED_SYSTEM_6_2016 <int>,
## #   track_2016 <int>, persfinretro_2016 <int>, econtrend_2016 <int>,
## #   Americatrend_2016 <int>, futuretrend_2016 <int>, wealth_2016 <int>,
## #   values_culture_2016 <int>, US_respect_2016 <int>,
## #   trustgovt_2016 <int>, trust_people_2016 <int>,
## #   helpful_people_2016 <int>, fair_people_2016 <int>, imiss_a_2016 <int>,
## #   imiss_b_2016 <int>, imiss_c_2016 <int>, imiss_d_2016 <int>,
## #   imiss_e_2016 <int>, imiss_f_2016 <int>, imiss_g_2016 <int>,
## #   imiss_h_2016 <int>, imiss_i_2016 <int>, imiss_j_2016 <int>,
## #   imiss_k_2016 <int>, imiss_l_2016 <int>, imiss_m_2016 <int>,
## #   imiss_n_2016 <int>, imiss_o_2016 <int>, imiss_p_2016 <int>,
## #   imiss_q_2016 <int>, imiss_r_2016 <int>, imiss_s_2016 <int>,
## #   imiss_t_2016 <int>, imiss_u_2016 <int>, imiss_x_2016 <int>,
## #   imiss_y_2016 <int>
# How many people voted?
voters %>%
    count(turnout16_2016)
## # A tibble: 2 x 2
##   turnout16_2016     n
##   <fct>          <int>
## 1 Did not vote     264
## 2 Voted           6428
# How do the reponses on the survey vary with voting behavior?
voters %>%
    group_by(turnout16_2016) %>%
    summarize(`Elections don't matter` = mean(RIGGED_SYSTEM_1_2016 <= 2),
              `Economy is getting better` = mean(econtrend_2016 == 1),
              `Crime is very important` = mean(imiss_a_2016 == 2))
## # A tibble: 2 x 4
##   turnout16_2016 `Elections don't ~ `Economy is gettin~ `Crime is very im~
##   <fct>                       <dbl>               <dbl>              <dbl>
## 1 Did not vote                0.553               0.163              0.292
## 2 Voted                       0.341               0.289              0.342
## Visualize difference by voter turnout
voters %>%
    ggplot(aes(econtrend_2016, ..density.., fill = turnout16_2016)) +
    geom_histogram(alpha = 0.5, position = "identity", binwidth = 1) +
    labs(title = "Overall, is the economy getting better or worse?")

# Remove the case_indetifier column
voters_select <- voters %>%
        select(-case_identifier)

# Build a simple logistic regression model
simple_glm <- glm(turnout16_2016 ~ .,  family = "binomial", 
                  data = voters_select)

# Print the summary                  
summary(simple_glm)
## 
## Call:
## glm(formula = turnout16_2016 ~ ., family = "binomial", data = voters_select)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2373   0.1651   0.2214   0.3004   1.7708  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.457036   0.732721   3.353 0.000799 ***
## RIGGED_SYSTEM_1_2016  0.236284   0.085081   2.777 0.005484 ** 
## RIGGED_SYSTEM_2_2016  0.064749   0.089208   0.726 0.467946    
## RIGGED_SYSTEM_3_2016  0.049357   0.107352   0.460 0.645680    
## RIGGED_SYSTEM_4_2016 -0.074694   0.087583  -0.853 0.393749    
## RIGGED_SYSTEM_5_2016  0.190252   0.096454   1.972 0.048556 *  
## RIGGED_SYSTEM_6_2016 -0.005881   0.101381  -0.058 0.953740    
## track_2016            0.241075   0.121467   1.985 0.047178 *  
## persfinretro_2016    -0.040229   0.106714  -0.377 0.706191    
## econtrend_2016       -0.295370   0.087224  -3.386 0.000708 ***
## Americatrend_2016    -0.105213   0.080845  -1.301 0.193116    
## futuretrend_2016      0.210568   0.071201   2.957 0.003103 ** 
## wealth_2016          -0.069405   0.026344  -2.635 0.008424 ** 
## values_culture_2016  -0.041402   0.038670  -1.071 0.284332    
## US_respect_2016      -0.068200   0.043785  -1.558 0.119322    
## trustgovt_2016        0.315354   0.166655   1.892 0.058456 .  
## trust_people_2016     0.040423   0.041518   0.974 0.330236    
## helpful_people_2016  -0.037513   0.035353  -1.061 0.288643    
## fair_people_2016     -0.017081   0.030170  -0.566 0.571294    
## imiss_a_2016          0.397121   0.138987   2.857 0.004273 ** 
## imiss_b_2016         -0.250803   0.155454  -1.613 0.106667    
## imiss_c_2016          0.017536   0.090647   0.193 0.846606    
## imiss_d_2016          0.043510   0.122118   0.356 0.721619    
## imiss_e_2016         -0.095552   0.078603  -1.216 0.224126    
## imiss_f_2016         -0.323280   0.105432  -3.066 0.002168 ** 
## imiss_g_2016         -0.332034   0.078673  -4.220 2.44e-05 ***
## imiss_h_2016         -0.157298   0.107111  -1.469 0.141954    
## imiss_i_2016          0.088695   0.091467   0.970 0.332196    
## imiss_j_2016          0.060271   0.138429   0.435 0.663280    
## imiss_k_2016         -0.181030   0.082726  -2.188 0.028646 *  
## imiss_l_2016          0.274689   0.106781   2.572 0.010098 *  
## imiss_m_2016         -0.124269   0.147888  -0.840 0.400746    
## imiss_n_2016         -0.441612   0.090040  -4.905 9.36e-07 ***
## imiss_o_2016          0.198635   0.143160   1.388 0.165286    
## imiss_p_2016          0.102987   0.105669   0.975 0.329751    
## imiss_q_2016          0.244567   0.119093   2.054 0.040017 *  
## imiss_r_2016          0.198839   0.121969   1.630 0.103050    
## imiss_s_2016         -0.067310   0.134465  -0.501 0.616666    
## imiss_t_2016         -0.116757   0.068143  -1.713 0.086639 .  
## imiss_u_2016          0.022902   0.097312   0.235 0.813939    
## imiss_x_2016         -0.017789   0.097349  -0.183 0.855003    
## imiss_y_2016          0.150205   0.094536   1.589 0.112092    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2224.3  on 6691  degrees of freedom
## Residual deviance: 2004.4  on 6650  degrees of freedom
## AIC: 2088.4
## 
## Number of Fisher Scoring iterations: 6
# Split data into training and testing sets
set.seed(1234)
in_train <- caret::createDataPartition(voters_select$turnout16_2016, p = 0.8, list = FALSE)
training <- voters_select[in_train, ]
testing <- voters_select[-in_train, ]


# Perform logistic regression with upsampling and no resampling
vote_glm_1 <- caret::train(turnout16_2016 ~ ., method = "glm", family = "binomial", data = training,
                           trControl = trainControl(method = "none", sampling = "up")
                           )

# Print vote_glm
vote_glm_1
## Generalized Linear Model 
## 
## 5355 samples
##   41 predictor
##    2 classes: 'Did not vote', 'Voted' 
## 
## No pre-processing
## Resampling: None 
## Addtional sampling using up-sampling
useSmall <- sort(sample(1:nrow(training), round(0.1*nrow(training)), replace=FALSE))
trainSmall <- training[useSmall, ]

# Logistic regression
vote_glm <- caret::train(turnout16_2016 ~ ., method = "glm", family = "binomial", data = trainSmall,
                         trControl = trainControl(method = "repeatedcv", repeats = 2, sampling = "up")
                         )
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Print vote_glm
vote_glm
## Generalized Linear Model 
## 
## 536 samples
##  41 predictor
##   2 classes: 'Did not vote', 'Voted' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times) 
## Summary of sample sizes: 482, 482, 482, 483, 482, 483, ... 
## Addtional sampling using up-sampling
## 
## Resampling results:
## 
##   Accuracy   Kappa     
##   0.8713138  0.04298445
# Random forest
vote_rf <- caret::train(turnout16_2016 ~ ., method = "rf", data = trainSmall,
                        trControl = trainControl(method="repeatedcv", repeats=2, sampling = "up")
                        )

# Print vote_rf
vote_rf
## Random Forest 
## 
## 536 samples
##  41 predictor
##   2 classes: 'Did not vote', 'Voted' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times) 
## Summary of sample sizes: 483, 482, 483, 482, 483, 483, ... 
## Addtional sampling using up-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa       
##    2    0.9674179  -0.001265823
##   21    0.9627184  -0.006073829
##   41    0.9542628   0.019107234
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Confusion matrix for logistic regression model on training data
caret::confusionMatrix(predict(vote_glm, trainSmall), trainSmall$turnout16_2016)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Did not vote Voted
##   Did not vote           17    48
##   Voted                   0   471
##                                          
##                Accuracy : 0.9104         
##                  95% CI : (0.883, 0.9332)
##     No Information Rate : 0.9683         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3836         
##  Mcnemar's Test P-Value : 1.17e-11       
##                                          
##             Sensitivity : 1.00000        
##             Specificity : 0.90751        
##          Pos Pred Value : 0.26154        
##          Neg Pred Value : 1.00000        
##              Prevalence : 0.03172        
##          Detection Rate : 0.03172        
##    Detection Prevalence : 0.12127        
##       Balanced Accuracy : 0.95376        
##                                          
##        'Positive' Class : Did not vote   
## 
# Confusion matrix for random forest model on training data
caret::confusionMatrix(predict(vote_rf, trainSmall), trainSmall$turnout16_2016)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Did not vote Voted
##   Did not vote           17     0
##   Voted                   0   519
##                                       
##                Accuracy : 1           
##                  95% CI : (0.9931, 1) 
##     No Information Rate : 0.9683      
##     P-Value [Acc > NIR] : 3.143e-08   
##                                       
##                   Kappa : 1           
##  Mcnemar's Test P-Value : NA          
##                                       
##             Sensitivity : 1.00000     
##             Specificity : 1.00000     
##          Pos Pred Value : 1.00000     
##          Neg Pred Value : 1.00000     
##              Prevalence : 0.03172     
##          Detection Rate : 0.03172     
##    Detection Prevalence : 0.03172     
##       Balanced Accuracy : 1.00000     
##                                       
##        'Positive' Class : Did not vote
## 
# Confusion matrix for logistic regression model on testing data
caret::confusionMatrix(predict(vote_glm, testing), testing$turnout16_2016)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Did not vote Voted
##   Did not vote           14   166
##   Voted                  38  1119
##                                          
##                Accuracy : 0.8474         
##                  95% CI : (0.827, 0.8663)
##     No Information Rate : 0.9611         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0642         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.26923        
##             Specificity : 0.87082        
##          Pos Pred Value : 0.07778        
##          Neg Pred Value : 0.96716        
##              Prevalence : 0.03889        
##          Detection Rate : 0.01047        
##    Detection Prevalence : 0.13463        
##       Balanced Accuracy : 0.57002        
##                                          
##        'Positive' Class : Did not vote   
## 
# Confusion matrix for random forest model on testing data
caret::confusionMatrix(predict(vote_rf, testing), testing$turnout16_2016)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Did not vote Voted
##   Did not vote            1     1
##   Voted                  51  1284
##                                           
##                Accuracy : 0.9611          
##                  95% CI : (0.9493, 0.9708)
##     No Information Rate : 0.9611          
##     P-Value [Acc > NIR] : 0.5368          
##                                           
##                   Kappa : 0.0343          
##  Mcnemar's Test P-Value : 1.083e-11       
##                                           
##             Sensitivity : 0.0192308       
##             Specificity : 0.9992218       
##          Pos Pred Value : 0.5000000       
##          Neg Pred Value : 0.9617978       
##              Prevalence : 0.0388930       
##          Detection Rate : 0.0007479       
##    Detection Prevalence : 0.0014959       
##       Balanced Accuracy : 0.5092263       
##                                           
##        'Positive' Class : Did not vote    
## 

Chapter 4 - Nuns

Catholic sisters survey from 1967 - https://curate.nd.edu/show/0r967368551 with codebook at https://curate.nd.edu/downloads/0v838051f6x

  • Responses from 130,000 sisters in ~400 congergations
  • There was significant change occuring during this time period, both in society at large and within the community of nuns
  • Age has been binned in groups of 10 years (has been recoded as a numeric at the top of the range, so 20 will mean 11-20 and 30 will mean 21-30 and the like)
  • Historical dataset, centered in the context of nuns in 1967
  • Good first step is to tidy the data, so that it is easier for exploratory data analysis
    • sisters67 %>% select(-sister) %>% gather(key, value, -age)

Exploratory data analysis with tidy data:

  • Easy to see levels of agreement (overall) using dplyr::count()
  • Agreement with specific questions by age
    • tidy_sisters %>% filter(key %in% paste0(“v”, 153:170)) %>% group_by(key, value) %>% summarise(age = mean(age)) %>% ggplot(aes(value, age, color = key)) + geom_line(alpha = 0.5, size = 1.5) + geom_point(size = 2) + facet_wrap(~key)
    • Can use the mix of responses to make estimates about the ages of the nuns
  • Data will be split in to training, validation, and test sets
    • The validation set will be used for model selection

Predicting age with supervised learning:

  • “rpart” - building a tree-based (CART) model
  • “xgbLinear” - extreme gradient boosting
  • “gbm” - gradient boosted ensembles
  • Validation datasets are useful for assessing hyper-parameters and model choices, leaving the test dataset pure for a final out-of-sample error estimate

Wrap up:

  • Train-Validation-Test to select the best models, tune the parameters, and estimate the out-of-sample error rates
  • Dealing with class imbalances; improving performance with resamples (bootstraps, cross-validation, etc.)
  • Hyper-parameter tuning can be valuable, but the time investment in other areas can often generate a greater return
  • Gradient boosting and random forests tend to perform very well, but there is always value in trying out multiple models
    • Start with EDA and begin with a very simple model

Example code includes:

sisters67 <- readr::read_csv("./RInputFiles/sisters.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer()
## )
## See spec(...) for full column specifications.
str(sisters67, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    19278 obs. of  67 variables:
##  $ age   : int  40 30 40 30 40 30 70 30 60 80 ...
##  $ sister: int  11545 16953 73323 75339 36303 95318 22474 114526 20707 91062 ...
##  $ v116  : int  5 4 2 4 4 2 4 4 4 5 ...
##  $ v117  : int  2 1 2 3 2 4 5 1 5 1 ...
##  $ v118  : int  2 4 5 3 3 5 5 4 5 2 ...
##  $ v119  : int  2 4 5 4 5 5 5 5 4 1 ...
##  $ v120  : int  4 1 3 3 1 1 5 1 5 2 ...
##  $ v121  : int  4 1 4 4 4 5 4 1 5 5 ...
##  $ v122  : int  4 1 2 2 4 1 1 2 2 1 ...
##  $ v123  : int  5 5 3 4 4 3 1 5 2 5 ...
##  $ v124  : int  1 1 5 2 3 1 5 3 5 4 ...
##  $ v125  : int  4 2 5 3 4 2 5 2 5 5 ...
##  $ v126  : int  2 1 1 3 1 1 5 1 5 2 ...
##  $ v127  : int  1 4 5 2 2 1 1 1 4 1 ...
##  $ v128  : int  2 1 4 3 4 4 5 2 5 3 ...
##  $ v129  : int  4 4 5 4 5 4 5 5 4 1 ...
##  $ v130  : int  2 4 4 3 3 1 5 1 5 4 ...
##  $ v131  : int  1 2 2 3 5 5 2 3 3 2 ...
##  $ v132  : int  5 5 5 4 5 2 2 5 4 5 ...
##  $ v133  : int  2 4 5 3 5 1 4 2 4 4 ...
##  $ v134  : int  2 4 4 3 4 4 1 4 4 2 ...
##  $ v135  : int  5 5 4 3 5 4 1 5 5 2 ...
##  $ v136  : int  1 4 4 2 4 4 1 4 4 2 ...
##  $ v137  : int  1 1 1 1 1 1 2 1 2 4 ...
##  $ v138  : int  2 1 3 1 3 1 4 1 2 1 ...
##  $ v139  : int  3 1 3 3 1 1 4 1 5 4 ...
##  $ v140  : int  1 2 1 2 4 4 5 2 5 2 ...
##  $ v141  : int  5 5 4 3 3 3 4 5 4 4 ...
##  $ v142  : int  1 1 2 2 2 1 2 1 4 3 ...
##  $ v143  : int  2 1 5 4 4 5 4 5 4 1 ...
##  $ v144  : int  1 2 1 2 1 1 3 1 4 2 ...
##  $ v145  : int  4 4 5 3 4 1 5 2 5 4 ...
##  $ v146  : int  4 4 5 4 5 5 4 5 2 4 ...
##  $ v147  : int  2 2 1 2 3 1 2 1 2 2 ...
##  $ v148  : int  1 1 4 1 1 4 4 1 5 1 ...
##  $ v149  : int  4 2 4 2 1 1 2 1 5 4 ...
##  $ v150  : int  2 1 2 3 1 4 2 1 5 2 ...
##  $ v151  : int  4 1 5 4 4 1 5 1 4 3 ...
##  $ v152  : int  2 1 1 3 1 1 2 1 4 4 ...
##  $ v153  : int  5 5 5 5 5 5 5 5 5 2 ...
##  $ v154  : int  1 1 4 2 1 3 5 1 4 2 ...
##  $ v155  : int  5 4 4 3 5 5 4 5 4 4 ...
##  $ v156  : int  1 1 2 2 1 1 5 1 5 2 ...
##  $ v157  : int  4 1 4 3 1 1 2 1 3 4 ...
##  $ v158  : int  4 4 5 2 5 5 2 5 5 4 ...
##  $ v159  : int  1 4 4 1 2 1 4 1 4 2 ...
##  $ v160  : int  2 5 5 4 4 4 5 5 5 4 ...
##  $ v161  : int  2 4 3 3 1 1 4 1 2 4 ...
##  $ v162  : int  5 4 5 4 4 4 5 5 5 4 ...
##  $ v163  : int  2 1 2 3 1 1 2 1 4 1 ...
##  $ v164  : int  4 1 5 2 4 1 5 1 5 4 ...
##  $ v165  : int  2 1 3 2 1 1 1 1 2 2 ...
##  $ v166  : int  2 4 5 2 1 1 5 2 5 4 ...
##  $ v167  : int  2 4 5 3 4 4 2 4 5 2 ...
##  $ v168  : int  5 5 5 4 5 5 5 5 4 5 ...
##  $ v169  : int  1 1 1 2 1 1 5 1 4 4 ...
##  $ v170  : int  5 1 4 3 2 4 4 1 2 4 ...
##  $ v171  : int  5 5 5 4 1 2 5 5 5 5 ...
##  $ v172  : int  2 1 5 5 2 2 5 1 5 3 ...
##  $ v173  : int  2 2 4 2 2 1 4 1 1 4 ...
##  $ v174  : int  2 4 2 3 4 1 5 5 4 2 ...
##  $ v175  : int  1 1 4 2 2 1 2 1 5 4 ...
##  $ v176  : int  4 4 4 3 1 4 4 3 3 2 ...
##  $ v177  : int  4 4 5 3 4 2 4 4 4 4 ...
##  $ v178  : int  4 1 4 2 1 1 2 1 4 4 ...
##  $ v179  : int  4 4 4 3 4 2 4 4 5 4 ...
##  $ v180  : int  4 2 5 3 3 1 1 1 1 2 ...
# View sisters67
glimpse(sisters67)
## Observations: 19,278
## Variables: 67
## $ age    <int> 40, 30, 40, 30, 40, 30, 70, 30, 60, 80, 90, 40, 60, 80,...
## $ sister <int> 11545, 16953, 73323, 75339, 36303, 95318, 22474, 114526...
## $ v116   <int> 5, 4, 2, 4, 4, 2, 4, 4, 4, 5, 2, 5, 4, 4, 3, 4, 5, 3, 4...
## $ v117   <int> 2, 1, 2, 3, 2, 4, 5, 1, 5, 1, 3, 2, 5, 4, 1, 1, 1, 1, 2...
## $ v118   <int> 2, 4, 5, 3, 3, 5, 5, 4, 5, 2, 4, 4, 4, 5, 2, 4, 4, 4, 2...
## $ v119   <int> 2, 4, 5, 4, 5, 5, 5, 5, 4, 1, 4, 5, 3, 4, 5, 5, 5, 5, 4...
## $ v120   <int> 4, 1, 3, 3, 1, 1, 5, 1, 5, 2, 3, 1, 5, 4, 4, 1, 1, 1, 2...
## $ v121   <int> 4, 1, 4, 4, 4, 5, 4, 1, 5, 5, 4, 1, 3, 4, 3, 2, 5, 3, 3...
## $ v122   <int> 4, 1, 2, 2, 4, 1, 1, 2, 2, 1, 4, 5, 1, 2, 4, 2, 1, 4, 2...
## $ v123   <int> 5, 5, 3, 4, 4, 3, 1, 5, 2, 5, 3, 4, 3, 4, 5, 5, 4, 5, 4...
## $ v124   <int> 1, 1, 5, 2, 3, 1, 5, 3, 5, 4, 4, 1, 3, 2, 1, 1, 3, 2, 2...
## $ v125   <int> 4, 2, 5, 3, 4, 2, 5, 2, 5, 5, 5, 5, 5, 5, 1, 1, 5, 1, 2...
## $ v126   <int> 2, 1, 1, 3, 1, 1, 5, 1, 5, 2, 4, 1, 5, 1, 3, 1, 5, 1, 2...
## $ v127   <int> 1, 4, 5, 2, 2, 1, 1, 1, 4, 1, 4, 1, 3, 5, 2, 1, 1, 2, 2...
## $ v128   <int> 2, 1, 4, 3, 4, 4, 5, 2, 5, 3, 2, 5, 5, 4, 1, 1, 4, 1, 1...
## $ v129   <int> 4, 4, 5, 4, 5, 4, 5, 5, 4, 1, 5, 1, 5, 5, 5, 1, 5, 5, 5...
## $ v130   <int> 2, 4, 4, 3, 3, 1, 5, 1, 5, 4, 5, 5, 1, 4, 1, 1, 4, 3, 2...
## $ v131   <int> 1, 2, 2, 3, 5, 5, 2, 3, 3, 2, 3, 4, 3, 4, 2, 4, 3, 4, 4...
## $ v132   <int> 5, 5, 5, 4, 5, 2, 2, 5, 4, 5, 4, 5, 5, 5, 4, 5, 3, 5, 5...
## $ v133   <int> 2, 4, 5, 3, 5, 1, 4, 2, 4, 4, 5, 1, 1, 1, 2, 4, 3, 1, 2...
## $ v134   <int> 2, 4, 4, 3, 4, 4, 1, 4, 4, 2, 3, 5, 2, 4, 4, 4, 3, 3, 4...
## $ v135   <int> 5, 5, 4, 3, 5, 4, 1, 5, 5, 2, 4, 5, 3, 5, 2, 5, 3, 5, 5...
## $ v136   <int> 1, 4, 4, 2, 4, 4, 1, 4, 4, 2, 4, 4, 4, 4, 2, 2, 4, 2, 2...
## $ v137   <int> 1, 1, 1, 1, 1, 1, 2, 1, 2, 4, 5, 1, 3, 1, 1, 1, 1, 1, 1...
## $ v138   <int> 2, 1, 3, 1, 3, 1, 4, 1, 2, 1, 3, 2, 1, 3, 2, 1, 4, 3, 1...
## $ v139   <int> 3, 1, 3, 3, 1, 1, 4, 1, 5, 4, 4, 1, 2, 4, 1, 1, 2, 1, 1...
## $ v140   <int> 1, 2, 1, 2, 4, 4, 5, 2, 5, 2, 2, 1, 5, 2, 1, 4, 1, 2, 2...
## $ v141   <int> 5, 5, 4, 3, 3, 3, 4, 5, 4, 4, 5, 5, 5, 5, 5, 4, 4, 3, 5...
## $ v142   <int> 1, 1, 2, 2, 2, 1, 2, 1, 4, 3, 4, 2, 2, 3, 2, 2, 1, 3, 1...
## $ v143   <int> 2, 1, 5, 4, 4, 5, 4, 5, 4, 1, 4, 5, 5, 2, 5, 5, 3, 3, 5...
## $ v144   <int> 1, 2, 1, 2, 1, 1, 3, 1, 4, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1...
## $ v145   <int> 4, 4, 5, 3, 4, 1, 5, 2, 5, 4, 4, 1, 4, 5, 2, 2, 1, 2, 2...
## $ v146   <int> 4, 4, 5, 4, 5, 5, 4, 5, 2, 4, 4, 4, 4, 4, 2, 5, 3, 5, 4...
## $ v147   <int> 2, 2, 1, 2, 3, 1, 2, 1, 2, 2, 3, 1, 2, 1, 2, 2, 3, 2, 4...
## $ v148   <int> 1, 1, 4, 1, 1, 4, 4, 1, 5, 1, 4, 1, 3, 1, 1, 1, 2, 1, 1...
## $ v149   <int> 4, 2, 4, 2, 1, 1, 2, 1, 5, 4, 4, 2, 5, 1, 1, 2, 5, 2, 1...
## $ v150   <int> 2, 1, 2, 3, 1, 4, 2, 1, 5, 2, 5, 2, 2, 2, 3, 1, 5, 1, 1...
## $ v151   <int> 4, 1, 5, 4, 4, 1, 5, 1, 4, 3, 4, 1, 2, 5, 2, 4, 5, 1, 4...
## $ v152   <int> 2, 1, 1, 3, 1, 1, 2, 1, 4, 4, 4, 1, 4, 3, 4, 1, 1, 1, 2...
## $ v153   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 3, 2, 5, 5, 4, 5, 5, 5, 4...
## $ v154   <int> 1, 1, 4, 2, 1, 3, 5, 1, 4, 2, 5, 1, 5, 5, 1, 1, 4, 1, 1...
## $ v155   <int> 5, 4, 4, 3, 5, 5, 4, 5, 4, 4, 3, 4, 3, 5, 2, 5, 5, 5, 1...
## $ v156   <int> 1, 1, 2, 2, 1, 1, 5, 1, 5, 2, 5, 1, 1, 4, 1, 1, 3, 1, 1...
## $ v157   <int> 4, 1, 4, 3, 1, 1, 2, 1, 3, 4, 2, 1, 2, 3, 3, 2, 3, 1, 1...
## $ v158   <int> 4, 4, 5, 2, 5, 5, 2, 5, 5, 4, 4, 5, 4, 2, 5, 4, 4, 3, 4...
## $ v159   <int> 1, 4, 4, 1, 2, 1, 4, 1, 4, 2, 4, 1, 3, 2, 1, 1, 2, 1, 1...
## $ v160   <int> 2, 5, 5, 4, 4, 4, 5, 5, 5, 4, 5, 2, 5, 5, 5, 4, 5, 2, 4...
## $ v161   <int> 2, 4, 3, 3, 1, 1, 4, 1, 2, 4, 5, 1, 4, 5, 1, 1, 3, 1, 1...
## $ v162   <int> 5, 4, 5, 4, 4, 4, 5, 5, 5, 4, 4, 5, 5, 5, 3, 4, 5, 5, 5...
## $ v163   <int> 2, 1, 2, 3, 1, 1, 2, 1, 4, 1, 4, 1, 1, 1, 1, 2, 3, 3, 1...
## $ v164   <int> 4, 1, 5, 2, 4, 1, 5, 1, 5, 4, 4, 1, 1, 5, 1, 4, 3, 1, 4...
## $ v165   <int> 2, 1, 3, 2, 1, 1, 1, 1, 2, 2, 5, 2, 1, 5, 2, 3, 3, 2, 4...
## $ v166   <int> 2, 4, 5, 2, 1, 1, 5, 2, 5, 4, 5, 1, 2, 4, 2, 4, 5, 3, 4...
## $ v167   <int> 2, 4, 5, 3, 4, 4, 2, 4, 5, 2, 4, 4, 2, 5, 2, 4, 3, 2, 4...
## $ v168   <int> 5, 5, 5, 4, 5, 5, 5, 5, 4, 5, 5, 4, 5, 5, 3, 4, 3, 4, 5...
## $ v169   <int> 1, 1, 1, 2, 1, 1, 5, 1, 4, 4, 5, 1, 1, 1, 1, 1, 1, 1, 1...
## $ v170   <int> 5, 1, 4, 3, 2, 4, 4, 1, 2, 4, 3, 3, 3, 5, 4, 3, 5, 3, 4...
## $ v171   <int> 5, 5, 5, 4, 1, 2, 5, 5, 5, 5, 5, 1, 5, 5, 3, 4, 5, 4, 5...
## $ v172   <int> 2, 1, 5, 5, 2, 2, 5, 1, 5, 3, 5, 1, 5, 5, 2, 2, 3, 5, 2...
## $ v173   <int> 2, 2, 4, 2, 2, 1, 4, 1, 1, 4, 4, 1, 2, 5, 4, 4, 3, 1, 4...
## $ v174   <int> 2, 4, 2, 3, 4, 1, 5, 5, 4, 2, 4, 5, 3, 4, 2, 4, 3, 3, 4...
## $ v175   <int> 1, 1, 4, 2, 2, 1, 2, 1, 5, 4, 3, 1, 2, 4, 1, 4, 3, 1, 1...
## $ v176   <int> 4, 4, 4, 3, 1, 4, 4, 3, 3, 2, 5, 5, 3, 5, 3, 1, 3, 3, 2...
## $ v177   <int> 4, 4, 5, 3, 4, 2, 4, 4, 4, 4, 5, 2, 5, 5, 3, 2, 5, 4, 4...
## $ v178   <int> 4, 1, 4, 2, 1, 1, 2, 1, 4, 4, 4, 1, 2, 4, 1, 2, 3, 1, 2...
## $ v179   <int> 4, 4, 4, 3, 4, 2, 4, 4, 5, 4, 5, 2, 5, 5, 3, 1, 5, 3, 4...
## $ v180   <int> 4, 2, 5, 3, 3, 1, 1, 1, 1, 2, 4, 2, 2, 5, 1, 1, 3, 3, 2...
# Plot the histogram
ggplot(sisters67, aes(x = age)) +
    geom_histogram(binwidth = 10)

# Tidy the data set
tidy_sisters <- sisters67 %>%
    select(-sister) %>%
    gather(key, value, -age)

# Print the structure of tidy_sisters
glimpse(tidy_sisters)
## Observations: 1,253,070
## Variables: 3
## $ age   <int> 40, 30, 40, 30, 40, 30, 70, 30, 60, 80, 90, 40, 60, 80, ...
## $ key   <chr> "v116", "v116", "v116", "v116", "v116", "v116", "v116", ...
## $ value <int> 5, 4, 2, 4, 4, 2, 4, 4, 4, 5, 2, 5, 4, 4, 3, 4, 5, 3, 4,...
# Overall agreement with all questions varied by age
tidy_sisters %>%
    group_by(age) %>%
    summarize(value = mean(value, na.rm = TRUE))
## # A tibble: 9 x 2
##     age value
##   <int> <dbl>
## 1    20  2.82
## 2    30  2.81
## 3    40  2.82
## 4    50  2.95
## 5    60  3.10
## 6    70  3.25
## 7    80  3.39
## 8    90  3.55
## 9   100  3.93
# Number of respondents agreed or disagreed overall
tidy_sisters %>%
    count(value)
## # A tibble: 5 x 2
##   value      n
##   <int>  <int>
## 1     1 326386
## 2     2 211534
## 3     3 160961
## 4     4 277062
## 5     5 277127
# Visualize agreement with age
tidy_sisters %>%
    filter(key %in% paste0("v", 153:170)) %>%
    group_by(key, value) %>%
    summarize(age = mean(age, na.rm = TRUE)) %>%
    ggplot(aes(value, age, color = key)) +
    geom_line(show.legend = FALSE) +
    facet_wrap(~key, nrow = 3)

# Remove the sister column
sisters_select <- sisters67 %>% 
    select(-sister)

# Build a simple linear regression model
simple_lm <- lm(age ~ ., 
                data = sisters_select)

# Print the summary of the model
summary(simple_lm)
## 
## Call:
## lm(formula = age ~ ., data = sisters_select)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.663  -9.586  -1.207   8.991  53.286 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.59542    1.07173  25.748  < 2e-16 ***
## v116        -0.69014    0.07727  -8.931  < 2e-16 ***
## v117        -0.15914    0.08869  -1.794 0.072786 .  
## v118        -0.74668    0.08473  -8.813  < 2e-16 ***
## v119        -0.35314    0.08321  -4.244 2.21e-05 ***
## v120        -0.13875    0.07513  -1.847 0.064813 .  
## v121         0.04265    0.07794   0.547 0.584247    
## v122         0.05237    0.08086   0.648 0.517208    
## v123        -0.96372    0.09061 -10.636  < 2e-16 ***
## v124         0.44543    0.08681   5.131 2.91e-07 ***
## v125         0.50420    0.07425   6.791 1.15e-11 ***
## v126         0.44358    0.08579   5.170 2.36e-07 ***
## v127        -0.04781    0.07915  -0.604 0.545810    
## v128         0.04459    0.07595   0.587 0.557162    
## v129         0.03044    0.07881   0.386 0.699351    
## v130         0.51028    0.08064   6.328 2.54e-10 ***
## v131        -0.54431    0.08417  -6.467 1.02e-10 ***
## v132        -0.02527    0.09337  -0.271 0.786703    
## v133        -0.67041    0.07563  -8.864  < 2e-16 ***
## v134        -0.12144    0.09060  -1.340 0.180130    
## v135         0.45773    0.10886   4.205 2.63e-05 ***
## v136        -0.08790    0.07438  -1.182 0.237293    
## v137         0.74412    0.10230   7.274 3.63e-13 ***
## v138         0.31534    0.10601   2.974 0.002939 ** 
## v139         1.36585    0.10514  12.990  < 2e-16 ***
## v140        -0.73675    0.07371  -9.995  < 2e-16 ***
## v141         0.50515    0.09355   5.400 6.75e-08 ***
## v142        -0.22168    0.08357  -2.653 0.007992 ** 
## v143         0.08320    0.08375   0.993 0.320536    
## v144         1.09413    0.10870  10.066  < 2e-16 ***
## v145        -0.46821    0.08217  -5.698 1.23e-08 ***
## v146        -0.50063    0.08094  -6.185 6.32e-10 ***
## v147        -0.28499    0.09800  -2.908 0.003640 ** 
## v148         1.47288    0.09165  16.070  < 2e-16 ***
## v149        -0.29683    0.08562  -3.467 0.000528 ***
## v150        -0.33882    0.08396  -4.036 5.46e-05 ***
## v151         0.79497    0.08901   8.931  < 2e-16 ***
## v152        -0.02073    0.08179  -0.253 0.799906    
## v153        -0.53982    0.09110  -5.925 3.17e-09 ***
## v154         0.98930    0.07843  12.614  < 2e-16 ***
## v155         0.96066    0.09897   9.706  < 2e-16 ***
## v156         1.07836    0.09176  11.752  < 2e-16 ***
## v157         0.07577    0.08249   0.918 0.358378    
## v158         0.05330    0.08419   0.633 0.526696    
## v159        -0.28846    0.08321  -3.467 0.000528 ***
## v160         0.28066    0.08559   3.279 0.001043 ** 
## v161         0.67235    0.08759   7.677 1.71e-14 ***
## v162        -0.29388    0.10063  -2.920 0.003501 ** 
## v163        -1.38883    0.09242 -15.027  < 2e-16 ***
## v164        -0.44411    0.07017  -6.329 2.52e-10 ***
## v165        -0.49356    0.09033  -5.464 4.71e-08 ***
## v166         0.24787    0.08329   2.976 0.002924 ** 
## v167        -0.06290    0.08185  -0.768 0.442236    
## v168         0.33712    0.09425   3.577 0.000349 ***
## v169         1.44938    0.08634  16.786  < 2e-16 ***
## v170         1.01626    0.09083  11.189  < 2e-16 ***
## v171         0.90086    0.08359  10.777  < 2e-16 ***
## v172         0.07702    0.07176   1.073 0.283135    
## v173         0.76461    0.06936  11.025  < 2e-16 ***
## v174         0.22074    0.07851   2.812 0.004934 ** 
## v175         0.18369    0.07930   2.316 0.020553 *  
## v176         1.03334    0.08996  11.487  < 2e-16 ***
## v177        -0.07908    0.09643  -0.820 0.412153    
## v178        -0.08005    0.08250  -0.970 0.331906    
## v179         0.29778    0.09251   3.219 0.001289 ** 
## v180         0.11524    0.08566   1.345 0.178538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.13 on 19212 degrees of freedom
## Multiple R-squared:  0.3332, Adjusted R-squared:  0.3309 
## F-statistic: 147.7 on 65 and 19212 DF,  p-value: < 2.2e-16
# Split the data into training and validation/test sets
set.seed(1234)
in_train <- caret::createDataPartition(sisters_select$age, p = 0.6, list = FALSE)
training <- sisters_select[in_train, ]
validation_test <- sisters_select[-in_train, ]

# Split the validation and test sets
set.seed(1234)
in_test <- caret::createDataPartition(validation_test$age, p = 0.5, list = FALSE)
testing <- validation_test[in_test, ]
validation <- validation_test[-in_test, ]


# Fit a CART model
sisters_cart <- caret::train(age ~ ., method = "rpart", data = training)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
# Print the CART model
sisters_cart
## CART 
## 
## 11569 samples
##    65 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11569, 11569, 11569, 11569, 11569, 11569, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.02304336  14.61359  0.1724244  12.00686
##   0.04935303  14.89119  0.1403800  12.41303
##   0.11481230  15.54485  0.1046127  13.19914
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02304336.
inSmall <- sample(1:nrow(training), 500, replace=FALSE)
smallSisters <- training[sort(inSmall), ]

sisters_xgb <- caret::train(age ~ ., method = "xgbTree", data = smallSisters)
sisters_gbm <- caret::train(age ~ ., method = "gbm", data = smallSisters, verbose=FALSE)

# Make predictions on the three models
modeling_results <- validation %>%
    mutate(CART = predict(sisters_cart, validation),
           XGB = predict(sisters_xgb, validation),
           GBM = predict(sisters_gbm, validation))

# View the predictions
modeling_results %>% 
    select(CART, XGB, GBM)
## # A tibble: 3,854 x 3
##     CART   XGB   GBM
##    <dbl> <dbl> <dbl>
##  1  49.5  46.2  44.3
##  2  49.5  61.1  56.5
##  3  58.0  59.9  65.6
##  4  58.0  60.0  61.9
##  5  58.0  71.6  74.6
##  6  49.5  50.9  53.4
##  7  49.5  58.6  55.0
##  8  49.5  42.2  38.0
##  9  41.3  41.7  38.2
## 10  58.0  51.6  50.0
## # ... with 3,844 more rows
# Compare performace
yardstick::metrics(modeling_results, truth = age, estimate = CART)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  14.6 0.163
yardstick::metrics(modeling_results, truth = age, estimate = XGB)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  13.5 0.287
yardstick::metrics(modeling_results, truth = age, estimate = GBM)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  13.6 0.286
# Calculate RMSE
testing %>%
    mutate(prediction = predict(sisters_gbm, testing)) %>%
    yardstick::rmse(truth = age, estimate = prediction)
## [1] 13.87981

Business Process Analytics in R

Chapter 1 - Introduction to Process Analysis

Introduction and overview:

  • Efficient processes are core to many businesses, and improved data makes further analysis possible
  • The “internet of things” has created significant amounts of event data - why, what, and who
    • Why is the purpose
    • What is the steps in the process
    • Who is the person responsible for the activity (can be machines or IS or the like; referred to as “resources”)
  • Process workflow is iterative across Extraction-Processing-Analysis

Activities as cornerstones of processes:

  • Data from an online learning platform; activities are captured and can be used for further analysis
  • Activities describe the flow of the process, and are one of the most important components of the process
    • bupaR::activities_labels() is like names() for activities data
    • bupaR::activities() is like summary() for activities data
  • Each case is described by the sequence of activities, known as its “trace”
    • bupaR::traces() will create a frequency table of the traces
    • bupaR::trace_explorer() will visualize the cases

Components of process data:

  • Cases are the objects flowing through the process, while activities are the actions performed on them
    • An activity instance is the occurrence of an activity (which can be a series of events) - specific action, case, time, etc.
    • The “lifecycle status” is an area like Scheduled, Started, Completed, and the like
    • The “event log” is the journal of the events
    • The “resources” are the actors in the process
  • Can create an event log using the eventlog() function
    • event_data %>% eventlog(case_id = “patient”, activity_id = “handling”, activity_instance_id = “handling_id”, timestamp = “time”, lifecycle_id = “registration_type”, resource = “employee”)

Example code includes:

# Load the processmapR package using library
library(processmapR)
## 
## Attaching package: 'processmapR'
## The following object is masked from 'package:stats':
## 
##     frequency
library(bupaR)
## Loading required package: edeaR
## Loading required package: eventdataR
## Loading required package: xesreadR
## Loading required package: processmonitR
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'processmonitR'
## Loading required package: petrinetR
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'petrinetR'
## 
## Attaching package: 'bupaR'
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:utils':
## 
##     timestamp
handling <- c('Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'X-Ray', 'X-Ray', 'X-Ray', 'X-Ray', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'X-Ray', 'X-Ray', 'X-Ray', 'X-Ray', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out')
patient <- c('43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '156', '170', '172', '184', '278', '348', '420', '43', '156', '170', '172', '184', '278', '348', '420', '155', '221', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '156', '170', '172', '184', '278', '348', '420', '43', '156', '170', '172', '184', '278', '348', '420', '155', '221', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493')
employee <- c('r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r5', 'r5', 'r5', 'r5', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r5', 'r5', 'r5', 'r5', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7')
handling_id <- c('43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '543', '655', '656', '670', '672', '684', '721', '778', '848', '920', '955', '993', '1020', '1072', '1081', '1082', '1088', '1127', '1163', '1199', '1257', '1309', '1318', '1319', '1325', '1364', '1400', '1436', '1557', '1587', '1710', '1730', '1777', '1889', '1890', '1904', '1906', '1918', '1955', '2012', '2082', '2154', '2189', '2227', '2272', '2384', '2385', '2399', '2401', '2413', '2450', '2507', '2577', '2649', '2684', '2720', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '543', '655', '656', '670', '672', '684', '721', '778', '848', '920', '955', '993', '1020', '1072', '1081', '1082', '1088', '1127', '1163', '1199', '1257', '1309', '1318', '1319', '1325', '1364', '1400', '1436', '1557', '1587', '1710', '1730', '1777', '1889', '1890', '1904', '1906', '1918', '1955', '2012', '2082', '2154', '2189', '2227', '2272', '2384', '2385', '2399', '2401', '2413', '2450', '2507', '2577', '2649', '2684', '2720')
registration_type <- c('start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete')
rTime <- c('2017-02-19 04:38:51', '2017-06-03 10:05:28', '2017-06-03 10:05:28', '2017-06-17 15:10:30', '2017-06-17 23:00:33', '2017-06-27 07:48:22', '2017-08-03 17:05:27', '2017-09-26 20:22:49', '2017-11-24 08:28:44', '2018-02-08 03:39:21', '2018-03-14 21:04:28', '2018-04-29 04:55:10', '2017-02-19 07:28:53', '2017-06-04 06:27:00', '2017-06-03 13:23:14', '2017-06-17 16:31:58', '2017-06-18 18:29:13', '2017-06-28 00:14:50', '2017-08-04 07:22:06', '2017-09-27 22:57:03', '2017-11-24 10:33:00', '2018-02-08 17:33:12', '2018-03-15 15:12:41', '2018-04-30 19:40:22', '2017-02-20 19:59:18', '2017-06-04 15:18:50', '2017-06-18 22:51:07', '2017-06-21 02:43:27', '2017-07-01 23:55:10', '2017-09-28 22:58:23', '2017-11-25 12:06:18', '2018-02-12 09:01:38', '2017-02-21 06:49:49', '2017-06-04 23:23:28', '2017-06-19 06:44:30', '2017-06-21 11:16:30', '2017-07-02 11:16:08', '2017-09-29 07:28:10', '2017-11-25 21:54:56', '2018-02-12 19:43:42', '2017-06-05 00:12:24', '2017-08-05 08:25:17', '2018-03-17 10:30:24', '2018-05-02 07:32:45', '2017-02-21 14:50:43', '2017-06-05 14:03:19', '2017-06-05 10:26:16', '2017-06-19 22:46:10', '2017-06-22 04:39:35', '2017-07-03 01:28:49', '2017-08-05 22:06:23', '2017-09-29 19:13:51', '2017-11-26 06:52:23', '2018-02-17 02:44:58', '2018-03-18 00:20:51', '2018-05-02 18:14:11', '2017-02-24 14:58:43', '2017-06-05 15:58:53', '2017-06-05 15:58:53', '2017-06-20 03:48:37', '2017-06-22 08:40:55', '2017-07-03 03:39:51', '2017-08-08 23:17:45', '2017-09-29 21:16:01', '2017-11-27 04:56:53', '2018-02-20 09:49:29', '2018-03-18 08:12:07', '2018-05-03 00:11:10', '2017-02-19 07:28:53', '2017-06-03 14:19:00', '2017-06-03 13:23:14', '2017-06-17 16:31:58', '2017-06-18 01:07:42', '2017-06-27 12:22:51', '2017-08-03 19:25:12', '2017-09-26 22:17:18', '2017-11-24 10:33:00', '2018-02-08 06:01:38', '2018-03-15 00:34:01', '2018-04-29 07:39:14', '2017-02-19 21:58:08', '2017-06-04 14:23:26', '2017-06-04 06:27:00', '2017-06-18 04:14:55', '2017-06-19 00:40:19', '2017-06-28 12:48:20', '2017-08-04 21:09:17', '2017-09-28 12:00:12', '2017-11-25 00:44:30', '2018-02-09 07:05:52', '2018-03-16 04:08:03', '2018-05-01 10:37:51', '2017-02-21 03:12:26', '2017-06-04 19:35:51', '2017-06-19 03:01:11', '2017-06-21 08:02:20', '2017-07-02 07:43:48', '2017-09-29 04:58:49', '2017-11-25 18:30:43', '2018-02-12 13:57:13', '2017-02-21 09:57:05', '2017-06-05 02:46:59', '2017-06-19 11:40:53', '2017-06-21 16:09:26', '2017-07-02 16:03:16', '2017-09-29 12:44:39', '2017-11-26 02:40:30', '2018-02-12 23:53:46', '2017-06-05 04:39:38', '2017-08-05 13:56:39', '2018-03-17 14:09:40', '2018-05-02 12:24:41', '2017-02-21 17:57:58', '2017-06-05 15:58:53', '2017-06-05 14:03:19', '2017-06-20 01:44:29', '2017-06-22 08:40:55', '2017-07-03 03:39:51', '2017-08-05 23:53:27', '2017-09-29 21:16:01', '2017-11-26 09:44:37', '2018-02-17 06:17:57', '2018-03-18 03:22:17', '2018-05-02 21:17:12', '2017-02-24 16:03:49', '2017-06-05 17:22:16', '2017-06-05 17:15:30', '2017-06-20 05:36:40', '2017-06-22 10:59:58', '2017-07-03 05:00:48', '2017-08-09 00:13:39', '2017-09-29 23:42:48', '2017-11-27 06:53:23', '2018-02-20 12:04:00', '2018-03-18 10:48:34', '2018-05-03 02:11:42')
rOrder <- c(43, 155, 156, 170, 172, 184, 221, 278, 348, 420, 455, 493, 543, 655, 656, 670, 672, 684, 721, 778, 848, 920, 955, 993, 1020, 1072, 1081, 1082, 1088, 1127, 1163, 1199, 1257, 1309, 1318, 1319, 1325, 1364, 1400, 1436, 1557, 1587, 1710, 1730, 1777, 1889, 1890, 1904, 1906, 1918, 1955, 2012, 2082, 2154, 2189, 2227, 2272, 2384, 2385, 2399, 2401, 2413, 2450, 2507, 2577, 2649, 2684, 2720, 2764, 2876, 2877, 2891, 2893, 2905, 2942, 2999, 3069, 3141, 3176, 3214, 3264, 3376, 3377, 3391, 3393, 3405, 3442, 3499, 3569, 3641, 3676, 3714, 3741, 3793, 3802, 3803, 3809, 3848, 3884, 3920, 3978, 4030, 4039, 4040, 4046, 4085, 4121, 4157, 4278, 4308, 4431, 4451, 4498, 4610, 4611, 4625, 4627, 4639, 4676, 4733, 4803, 4875, 4910, 4948, 4993, 5105, 5106, 5120, 5122, 5134, 5171, 5228, 5298, 5370, 5405, 5441)

pFrame <- tibble(handling=factor(handling, levels=c('Blood test', 'Check-out', 'Discuss Results', 'MRI SCAN', 'Registration', 'Triage and Assessment', 'X-Ray')), 
                 patient=patient, 
                 employee=factor(employee, levels=c('r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7')), 
                 handling_id=handling_id, 
                 registration_type=factor(registration_type, levels=c("complete", "start")), 
                 time=as.POSIXct(rTime), 
                 .order=rOrder
                 )

patients <- eventlog(pFrame,
    case_id = "patient",
    activity_id = "handling",
    activity_instance_id = "handling_id",
    lifecycle_id = "registration_type",
    timestamp = "time",
    resource_id = "employee")


# The function slice can be used to take a slice of cases out of the eventdata. slice(1:10) will select the first ten cases in the event log, where first is defined by the current ordering of the data.

# How many patients are there?
n_cases(patients)
## [1] 12
# Print the summary of the data
summary(patients)
## Number of events:  136
## Number of cases:  12
## Number of traces:  2
## Number of distinct activities:  7
## Average trace length:  11.33333
## 
## Start eventlog:  2017-02-19 04:38:51
## End eventlog:  2018-05-03 02:11:42
##                   handling    patient          employee handling_id       
##  Blood test           :16   Length:136         r1:24    Length:136        
##  Check-out            :24   Class :character   r2:24    Class :character  
##  Discuss Results      :24   Mode  :character   r3:16    Mode  :character  
##  MRI SCAN             :16                      r4:16                      
##  Registration         :24                      r5: 8                      
##  Triage and Assessment:24                      r6:24                      
##  X-Ray                : 8                      r7:24                      
##  registration_type      time                         .order      
##  complete:68       Min.   :2017-02-19 04:38:51   Min.   :  1.00  
##  start   :68       1st Qu.:2017-06-14 15:43:26   1st Qu.: 34.75  
##                    Median :2017-07-03 03:39:51   Median : 68.50  
##                    Mean   :2017-09-06 11:31:32   Mean   : 68.50  
##                    3rd Qu.:2017-11-26 14:32:41   3rd Qu.:102.25  
##                    Max.   :2018-05-03 02:11:42   Max.   :136.00  
## 
# Show the journey of the first patient
slice(patients, 1)
## Log of 12 events consisting of:
## 1 trace 
## 1 case 
## 6 instances of 6 activities 
## 6 resources 
## Events occurred from 2017-02-19 04:38:51 until 2017-02-24 16:03:49 
##  
## Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 12 x 7
##    handling patient employee handling_id registration_ty~
##    <fct>    <chr>   <fct>    <chr>       <fct>           
##  1 Registr~ 43      r1       43          start           
##  2 Triage ~ 43      r2       543         start           
##  3 Blood t~ 43      r3       1020        start           
##  4 MRI SCAN 43      r4       1257        start           
##  5 Discuss~ 43      r6       1777        start           
##  6 Check-o~ 43      r7       2272        start           
##  7 Registr~ 43      r1       43          complete        
##  8 Triage ~ 43      r2       543         complete        
##  9 Blood t~ 43      r3       1020        complete        
## 10 MRI SCAN 43      r4       1257        complete        
## 11 Discuss~ 43      r6       1777        complete        
## 12 Check-o~ 43      r7       2272        complete        
## # ... with 2 more variables: time <dttm>, .order <int>
# How many distinct activities are there?
n_activities(patients)
## [1] 7
# What are the names of the activities?
activity_labels(patients)
## [1] Registration          Triage and Assessment Blood test           
## [4] MRI SCAN              X-Ray                 Discuss Results      
## [7] Check-out            
## 7 Levels: Blood test Check-out Discuss Results MRI SCAN ... X-Ray
# Create a list of activities
activities(patients)
## # A tibble: 7 x 3
##   handling              absolute_frequency relative_frequency
##   <fct>                              <int>              <dbl>
## 1 Check-out                             12             0.176 
## 2 Discuss Results                       12             0.176 
## 3 Registration                          12             0.176 
## 4 Triage and Assessment                 12             0.176 
## 5 Blood test                             8             0.118 
## 6 MRI SCAN                               8             0.118 
## 7 X-Ray                                  4             0.0588
# Have a look at the different traces
traces(patients)
## # A tibble: 2 x 3
##   trace                                  absolute_frequen~ relative_freque~
##   <chr>                                              <int>            <dbl>
## 1 Registration,Triage and Assessment,Bl~                 8            0.667
## 2 Registration,Triage and Assessment,X-~                 4            0.333
# How many are there?
n_traces(patients)
## [1] 2
# Visualize the traces using trace_explorer
trace_explorer(patients, coverage=1)

# Draw process map
process_map(patients)
## Warning: Prefixing `UQ()` with the rlang namespace is deprecated as of rlang 0.3.0.
## Please use the non-prefixed form or `!!` instead.
## 
##   # Bad:
##   rlang::expr(mean(rlang::UQ(var) * 100))
## 
##   # Ok:
##   rlang::expr(mean(UQ(var) * 100))
## 
##   # Good:
##   rlang::expr(mean(!!var * 100))
## 
## This warning is displayed once per session.
claims <- tibble(id=c("claim1", "claim1", "claim2", "claim2", "claim2"), 
                 action=c(10002L, 10011L, 10015L, 10024L, 10024L), 
                 action_type=c("Check Contract", "Pay Back Decision", "Check Contract", "Pay Back Decision", "Pay Back Decision"), 
                 date=as.Date(c("2008-01-12", "2008-03-22", "2008-01-13", "2008-03-23", "2008-04-14")), 
                 originator=c("Assistant 1", "Manager 2", "Assistant 6", "Manager 2", "Manager 2"), 
                 status=as.factor(c("start", "start", "start", "start", "complete"))
                 )
claims
## # A tibble: 5 x 6
##   id     action action_type       date       originator  status  
##   <chr>   <int> <chr>             <date>     <chr>       <fct>   
## 1 claim1  10002 Check Contract    2008-01-12 Assistant 1 start   
## 2 claim1  10011 Pay Back Decision 2008-03-22 Manager 2   start   
## 3 claim2  10015 Check Contract    2008-01-13 Assistant 6 start   
## 4 claim2  10024 Pay Back Decision 2008-03-23 Manager 2   start   
## 5 claim2  10024 Pay Back Decision 2008-04-14 Manager 2   complete
#create eventlog claims_log 
claims_log <- eventlog(claims,
    case_id = "id",
    activity_id = "action_type",
    activity_instance_id = "action",
    lifecycle_id = "status",
    timestamp = "date",
    resource_id = "originator")

# Print summary
summary(claims_log)
## Number of events:  5
## Number of cases:  2
## Number of traces:  1
## Number of distinct activities:  2
## Average trace length:  2.5
## 
## Start eventlog:  2008-01-12
## End eventlog:  2008-04-14
##       id               action                     action_type
##  Length:5           Length:5           Check Contract   :2   
##  Class :character   Class :character   Pay Back Decision:3   
##  Mode  :character   Mode  :character                         
##                                                              
##                                                              
##                                                              
##       date                  originator      status      .order 
##  Min.   :2008-01-12   Assistant 1:1    complete:1   Min.   :1  
##  1st Qu.:2008-01-13   Assistant 6:1    start   :4   1st Qu.:2  
##  Median :2008-03-22   Manager 2  :3                 Median :3  
##  Mean   :2008-02-28                                 Mean   :3  
##  3rd Qu.:2008-03-23                                 3rd Qu.:4  
##  Max.   :2008-04-14                                 Max.   :5
# Check activity labels
activity_labels(claims_log)
## [1] Check Contract    Pay Back Decision
## Levels: Check Contract Pay Back Decision
# Once you have an eventlog, you can access its complete metadata using the function mapping or the functions case_id, activity_id etc., to inspect individual identifiers.

Chapter 2 - Analysis Techniques

Organizational analysis:

  • Processes are always dependent on resources, even if automated (machines and algorithms can be resources)
    • Who executes the task, how specialized is the knowledge, etc.
    • resource_labels(log_hospital) # will pull out the resources
    • resources(log_hospital) # will pull out frequencies by resource
  • Can create a resource-activity matrix
    • A person who performs only a few activities is considered to be specialized in that activity
    • If only one person ever performs a specific activity, then there is a high risk of “brain drain”
    • The plot() function, applied to an event_log, will create the resource-activity matrix
    • resource_map(log_hospital) # shows arrows between the work flows

Structuredness:

  • Control-flow refers to the succession of activities
    • Each unique flow is referred to as a trace
    • Metrics include entry/exit points, length of cases, presence of activities, rework, etc.
    • log_healthcare %>% start_activities(“activity”) %>% plot()
    • log_healthcare %>% end_activities(“activity”) %>% plot()
  • Rework is when the same activity is done multiple times for the same case
    • Repetitions are when the activity is repeated after some intervening steps
    • Sel-loops are when the activity is repeated immediately after itself
  • The precedence matrix shows the relationships between the activities in a more structured manner
    • eventlog %>% precedence_matrix(type = “absolute”) %>% plot # can be type=“relative” also

Performance analysis:

  • Visuals can include performance process maps and dotted charts; metrics can include throughput time, processing time, idle time
    • eventlog %>% process_map(type = frequency()) # normal process map
    • eventlog %>% process_map(type = performance()) # performance process map
  • The dotted chart shows the freqency of activities over time; basically, a form of scatter plot
    • throughput_time is total time, processing_time is the sum of activity time, idle_time is the sume of when nothing is happening

Linking perspectives:

  • Granularity can help give the statistics at the desired levels
    • (level = “log”, “trace”, “case”, “activity”, “resource”, “resource-activity”)
  • Categorical data can be leveraged using the group_by() functionality - each group will then be calculated separately
    • eventlog %>% group_by(priority) %>% number_of_repetitions(level = “resource”) %>% plot()

Example code includes:

data(sepsis, package="eventdataR")
str(sepsis)
## Classes 'eventlog', 'tbl_df', 'tbl' and 'data.frame':    15214 obs. of  34 variables:
##  $ case_id                  : chr  "A" "A" "A" "A" ...
##  $ activity                 : Factor w/ 16 levels "Admission IC",..: 4 10 3 9 6 5 8 7 2 3 ...
##  $ lifecycle                : Factor w/ 1 level "complete": 1 1 1 1 1 1 1 1 1 1 ...
##  $ resource                 : Factor w/ 26 levels "?","A","B","C",..: 2 3 3 3 4 2 2 2 5 3 ...
##  $ timestamp                : POSIXct, format: "2014-10-22 11:15:41" "2014-10-22 11:27:00" ...
##  $ age                      : int  85 NA NA NA NA NA NA NA NA NA ...
##  $ crp                      : num  NA NA 210 NA NA NA NA NA NA 1090 ...
##  $ diagnose                 : chr  "A" NA NA NA ...
##  $ diagnosticartastrup      : chr  "true" NA NA NA ...
##  $ diagnosticblood          : chr  "true" NA NA NA ...
##  $ diagnosticecg            : chr  "true" NA NA NA ...
##  $ diagnosticic             : chr  "true" NA NA NA ...
##  $ diagnosticlacticacid     : chr  "true" NA NA NA ...
##  $ diagnosticliquor         : chr  "false" NA NA NA ...
##  $ diagnosticother          : chr  "false" NA NA NA ...
##  $ diagnosticsputum         : chr  "false" NA NA NA ...
##  $ diagnosticurinaryculture : chr  "true" NA NA NA ...
##  $ diagnosticurinarysediment: chr  "true" NA NA NA ...
##  $ diagnosticxthorax        : chr  "true" NA NA NA ...
##  $ disfuncorg               : chr  "true" NA NA NA ...
##  $ hypotensie               : chr  "true" NA NA NA ...
##  $ hypoxie                  : chr  "false" NA NA NA ...
##  $ infectionsuspected       : chr  "true" NA NA NA ...
##  $ infusion                 : chr  "true" NA NA NA ...
##  $ lacticacid               : chr  NA NA NA "2.2" ...
##  $ leucocytes               : chr  NA "9.6" NA NA ...
##  $ oligurie                 : chr  "false" NA NA NA ...
##  $ sirscritheartrate        : chr  "true" NA NA NA ...
##  $ sirscritleucos           : chr  "false" NA NA NA ...
##  $ sirscrittachypnea        : chr  "true" NA NA NA ...
##  $ sirscrittemperature      : chr  "true" NA NA NA ...
##  $ sirscriteria2ormore      : chr  "true" NA NA NA ...
##  $ activity_instance_id     : chr  "1" "2" "3" "4" ...
##  $ .order                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, "case_id")= chr "case_id"
##  - attr(*, "activity_id")= chr "activity"
##  - attr(*, "activity_instance_id")= chr "activity_instance_id"
##  - attr(*, "lifecycle_id")= chr "lifecycle"
##  - attr(*, "resource_id")= chr "resource"
##  - attr(*, "timestamp")= chr "timestamp"
# Print list of resources
resource_frequency(sepsis, level="resource")
## # A tibble: 26 x 3
##    resource absolute relative
##    <fct>       <int>    <dbl>
##  1 B            8111  0.533  
##  2 A            3462  0.228  
##  3 C            1053  0.0692 
##  4 E             782  0.0514 
##  5 ?             294  0.0193 
##  6 F             216  0.0142 
##  7 L             213  0.0140 
##  8 O             186  0.0122 
##  9 G             148  0.00973
## 10 I             126  0.00828
## # ... with 16 more rows
# Number of resources per activity
resource_frequency(sepsis, level = "activity")
## # A tibble: 16 x 11
##    activity nr_of_resources   min    q1   mean median     q3   max st_dev
##    <fct>              <int> <int> <dbl>  <dbl>  <dbl>  <dbl> <int>  <dbl>
##  1 Admissi~               4     1    7    29.2   31     53.2    54   28.2
##  2 Admissi~              20     1   17    59.1   40.5   68.2   216   62.7
##  3 CRP                    1  3262 3262  3262   3262   3262    3262   NA  
##  4 ER Regi~               2    65  295   525    525    755     985  651. 
##  5 ER Seps~               2    65  295.  524.   524.   754.    984  650. 
##  6 ER Tria~               1  1053 1053  1053   1053   1053    1053   NA  
##  7 IV Anti~               2    45  228.  412.   412.   595.    778  518. 
##  8 IV Liqu~               2    38  207.  376.   376.   546.    715  479. 
##  9 LacticA~               1  1466 1466  1466   1466   1466    1466   NA  
## 10 Leucocy~               1  3383 3383  3383   3383   3383    3383   NA  
## 11 Release~               1   671  671   671    671    671     671   NA  
## 12 Release~               1    56   56    56     56     56      56   NA  
## 13 Release~               1    25   25    25     25     25      25   NA  
## 14 Release~               1    24   24    24     24     24      24   NA  
## 15 Release~               1     6    6     6      6      6       6   NA  
## 16 Return ~               1   294  294   294    294    294     294   NA  
## # ... with 2 more variables: iqr <dbl>, total <int>
# Plot Number of executions per resource-activity (not working in R 3.5.3)
# resource_frequency(sepsis, level = "resource-activity") %>% plot


# Calculate resource involvement
resource_involvement(sepsis, level="resource")
## # A tibble: 26 x 3
##    resource absolute relative
##    <fct>       <int>    <dbl>
##  1 C            1050   1     
##  2 B            1013   0.965 
##  3 A             985   0.938 
##  4 E             782   0.745 
##  5 ?             294   0.28  
##  6 F             200   0.190 
##  7 O             179   0.170 
##  8 G             147   0.14  
##  9 I             118   0.112 
## 10 M              82   0.0781
## # ... with 16 more rows
# Show graphically 
sepsis %>% resource_involvement(level = "resource") %>% plot

# Compare with resource frequency
resource_frequency(sepsis, level="resource")
## # A tibble: 26 x 3
##    resource absolute relative
##    <fct>       <int>    <dbl>
##  1 B            8111  0.533  
##  2 A            3462  0.228  
##  3 C            1053  0.0692 
##  4 E             782  0.0514 
##  5 ?             294  0.0193 
##  6 F             216  0.0142 
##  7 L             213  0.0140 
##  8 O             186  0.0122 
##  9 G             148  0.00973
## 10 I             126  0.00828
## # ... with 16 more rows
# Min, max and average number of repetitions
sepsis %>% number_of_repetitions(level = "log")
## Using default type: all
##      min       q1   median     mean       q3      max   st_dev      iqr 
## 0.000000 0.000000 2.000000 1.640000 3.000000 5.000000 1.280461 3.000000 
## attr(,"type")
## [1] "all"
# Plot repetitions per activity
sepsis %>% number_of_repetitions(level = "activity") %>% plot
## Using default type: all

# Number of repetitions per resources
sepsis %>% number_of_repetitions(level = "resource")
## Using default type: all
## # Description: df[,3] [26 x 3]
##    first_resource absolute relative
##    <fct>             <dbl>    <dbl>
##  1 B                  1536   0.189 
##  2 G                    67   0.453 
##  3 F                    16   0.0741
##  4 R                    13   0.228 
##  5 I                    12   0.0952
##  6 Q                    11   0.175 
##  7 O                     9   0.0484
##  8 J                     8   0.308 
##  9 T                     8   0.229 
## 10 K                     7   0.389 
## # ... with 16 more rows
eci <- c('21', '21', '21', '21', '21', '21', '21', '21', '21', '31', '31', '31', '31', '31', '31', '31', '31', '31', '31', '41', '41', '41', '41', '41', '41', '41', '51', '51', '51', '51', '51', '51', '51', '61', '61', '61', '61', '61', '61', '91', '91', '91', '91', '91', '91', '101', '101', '101', '101', '101', '101', '111', '111', '111', '111', '121', '121', '121', '121', '121', '121', '121', '121', '121', '131', '131', '131', '131', '131', '131', '131', '131', '161', '161', '171', '171', '171', '171', '181', '181', '181', '181', '181', '181', '201', '201', '201', '201', '201', '201', '201', '12', '12', '12', '12', '12', '22', '22', '22', '22', '22', '22', '32', '32', '32', '32', '32', '32', '42', '42', '42', '42', '52', '52', '52', '52', '52', '82', '82', '82', '82', '82', '92', '92', '92', '92', '92', '102', '102', '102', '102', '102', '112', '112', '122', '122', '21', '21', '21', '21', '21', '21', '21', '21', '21', '31', '31', '31', '31', '31', '31', '31', '31', '31', '31', '41', '41', '41', '41', '41', '41', '41', '51', '51', '51', '51', '51', '51', '51', '61', '61', '61', '61', '61', '61', '91', '91', '91', '91', '91', '91', '101', '101', '101', '101', '101', '101', '111', '111', '111', '111', '121', '121', '121', '121', '121', '121', '121', '121', '121', '131', '131', '131', '131', '131', '131', '131', '131', '161', '161', '171', '171', '171', '171', '181', '181', '181', '181', '181', '181', '201', '201', '201', '201', '201', '201', '201', '12', '12', '12', '12', '12', '22', '22', '22', '22', '22', '22', '32', '32', '32', '32', '32', '32', '42', '42', '42', '42', '52', '52', '52', '52', '52', '82', '82', '82', '82', '82', '92', '92', '92', '92', '92', '102', '102', '102', '102', '102', '112', '112', '122', '122')
ea1 <- c('prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareBreakfast', 'eatingBreakfast', 'prepareDinner', 'eatingDinner', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareDinner', 'eatingDinner', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'prepareBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'snack', 'eatingBreakfast', 'prepareBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'eatingBreakfast', 'prepareBreakfast', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'snack', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'eatingLunch', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareBreakfast')
ea2 <- c('eatingBreakfast', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareBreakfast', 'eatingBreakfast', 'prepareDinner', 'eatingDinner', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareDinner', 'eatingDinner', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'prepareBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'snack', 'eatingBreakfast', 'prepareBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'eatingBreakfast', 'prepareBreakfast', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'snack', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'eatingLunch', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareBreakfast', 'eatingBreakfast')
eaii <- c('9', '10', '19', '23', '24', '26', '36', '40', '41', '51', '52', '58', '60', '62', '63', '67', '69', '72', '73', '86', '87', '89', '90', '104', '105', '107', '119', '120', '128', '132', '133', '138', '139', '149', '150', '156', '159', '160', '164', '174', '175', '192', '194', '195', '198', '205', '206', '208', '211', '213', '214', '229', '236', '237', '239', '245', '251', '252', '253', '255', '259', '260', '262', '264', '271', '276', '281', '287', '292', '293', '297', '299', '310', '312', '331', '332', '336', '347', '363', '364', '374', '376', '387', '389', '434', '435', '447', '448', '450', '453', '454', '462', '463', '471', '472', '475', '483', '484', '487', '491', '492', '496', '508', '509', '512', '517', '518', '522', '536', '540', '541', '543', '562', '563', '565', '566', '572', '584', '585', '589', '590', '598', '615', '616', '618', '619', '627', '639', '640', '642', '643', '653', '665', '666', '682', '683', '9', '10', '19', '23', '24', '26', '36', '40', '41', '51', '52', '58', '60', '62', '63', '67', '69', '72', '73', '86', '87', '89', '90', '104', '105', '107', '119', '120', '128', '132', '133', '138', '139', '149', '150', '156', '159', '160', '164', '174', '175', '192', '194', '195', '198', '205', '206', '208', '211', '213', '214', '229', '236', '237', '239', '245', '251', '252', '253', '255', '259', '260', '262', '264', '271', '276', '281', '287', '292', '293', '297', '299', '310', '312', '331', '332', '336', '347', '363', '364', '374', '376', '387', '389', '434', '435', '447', '448', '450', '453', '454', '462', '463', '471', '472', '475', '483', '484', '487', '491', '492', '496', '508', '509', '512', '517', '518', '522', '536', '540', '541', '543', '562', '563', '565', '566', '572', '584', '585', '589', '590', '598', '615', '616', '618', '619', '627', '639', '640', '642', '643', '653', '665', '666', '682', '683')
elci <- c('start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete')
ets1 <- c('2012-11-12 09:42:02', '2012-11-12 09:52:33', '2012-11-12 11:05:44', '2012-11-12 13:45:49', '2012-11-12 13:48:49', '2012-11-12 15:23:00', '2012-11-12 18:47:29', '2012-11-12 22:35:21', '2012-11-12 22:35:21', '2012-11-13 08:56:37', '2012-11-13 09:04:54', '2012-11-13 10:14:04', '2012-11-13 13:47:45', '2012-11-13 14:08:24', '2012-11-13 14:19:01', '2012-11-13 17:34:23', '2012-11-13 18:51:51', '2012-11-13 23:05:07', '2012-11-13 23:17:07', '2012-11-14 09:06:08', '2012-11-14 09:17:48', '2012-11-14 10:38:16', '2012-11-14 10:44:16', '2012-11-14 21:30:09', '2012-11-14 21:37:09', '2012-11-14 22:14:23', '2012-11-15 09:37:15', '2012-11-15 09:47:12', '2012-11-15 10:11:08', '2012-11-15 14:35:27', '2012-11-15 14:41:27', '2012-11-15 22:07:26', '2012-11-15 22:26:02', '2012-11-16 10:39:14', '2012-11-16 10:52:56', '2012-11-16 12:09:10', '2012-11-16 14:13:00', '2012-11-16 14:19:00', '2012-11-16 18:11:36', '2012-11-19 10:13:23', '2012-11-19 10:25:00', '2012-11-19 15:55:22', '2012-11-19 21:47:27', '2012-11-19 21:59:27', '2012-11-19 22:31:06', '2012-11-20 10:20:00', '2012-11-20 10:21:02', '2012-11-20 11:00:16', '2012-11-20 13:03:28', '2012-11-20 14:25:11', '2012-11-20 14:41:22', '2012-11-21 10:01:00', '2012-11-21 15:02:08', '2012-11-21 15:15:08', '2012-11-21 17:50:29', '2012-11-22 01:40:42', '2012-11-22 10:19:15', '2012-11-22 10:26:15', '2012-11-22 11:02:27', '2012-11-22 11:56:06', '2012-11-22 15:05:51', '2012-11-22 15:12:55', '2012-11-22 16:43:08', '2012-11-22 18:15:32', '2012-11-23 00:36:00', '2012-11-23 01:03:00', '2012-11-23 09:49:00', '2012-11-23 12:53:06', '2012-11-23 14:01:08', '2012-11-23 14:23:08', '2012-11-23 16:57:24', '2012-11-23 17:58:00', '2012-11-26 09:06:12', '2012-11-26 09:57:12', '2012-11-27 10:20:26', '2012-11-27 10:30:50')
ets2 <- c('2012-11-27 11:54:15', '2012-11-27 19:46:15', '2012-11-28 09:27:15', '2012-11-28 09:34:15', '2012-11-28 12:28:02', '2012-11-28 13:16:33', '2012-11-28 19:30:08', '2012-11-28 22:15:02', '2012-11-30 10:43:19', '2012-11-30 10:46:19', '2012-11-30 14:51:36', '2012-11-30 15:08:36', '2012-11-30 17:30:40', '2012-11-30 22:12:05', '2012-11-30 22:16:07', '2011-11-28 10:38:00', '2011-11-28 10:43:00', '2011-11-28 14:31:06', '2011-11-28 14:42:00', '2011-11-28 20:20:55', '2011-11-29 12:09:09', '2011-11-29 12:11:01', '2011-11-29 13:25:29', '2011-11-29 15:15:14', '2011-11-29 15:23:00', '2011-11-29 16:32:20', '2011-11-30 10:23:46', '2011-11-30 10:28:46', '2011-11-30 13:05:27', '2011-11-30 14:39:42', '2011-11-30 14:56:00', '2011-11-30 16:41:05', '2011-11-30 14:37:00', '2011-12-01 11:17:05', '2011-12-01 11:20:05', '2011-12-01 14:29:37', '2011-12-02 12:29:08', '2011-12-02 12:32:08', '2011-12-02 14:47:18', '2011-12-02 14:51:00', '2011-12-02 19:40:44', '2011-12-05 12:15:45', '2011-12-05 12:18:05', '2011-12-05 15:00:55', '2011-12-05 15:14:00', '2011-12-05 19:24:11', '2011-12-06 11:30:19', '2011-12-06 11:33:02', '2011-12-06 14:41:16', '2011-12-06 14:56:00', '2011-12-06 19:22:50', '2011-12-07 11:12:17', '2011-12-07 11:17:22', '2011-12-07 14:04:32', '2011-12-07 14:14:00', '2011-12-07 19:23:55', '2011-12-08 11:25:12', '2011-12-08 11:29:01', '2011-12-09 11:00:13', '2011-12-09 11:03:33', '2012-11-12 09:50:02', '2012-11-12 09:55:29', '2012-11-12 12:39:42', '2012-11-12 14:48:14', '2012-11-12 14:53:14', '2012-11-12 15:31:53', '2012-11-12 19:00:56', '2012-11-12 22:37:55', '2012-11-12 22:40:55', '2012-11-13 09:00:26', '2012-11-13 09:10:12', '2012-11-13 10:51:55', '2012-11-13 14:03:31', '2012-11-13 14:18:36', '2012-11-13 14:42:36', '2012-11-13 17:36:34', '2012-11-13 19:45:03', '2012-11-13 23:15:33', '2012-11-13 23:37:33', '2012-11-14 09:09:41', '2012-11-14 09:21:43', '2012-11-14 11:43:23', '2012-11-14 11:06:23', '2012-11-14 21:35:17', '2012-11-14 21:47:18', '2012-11-14 22:17:47', '2012-11-15 09:44:06', '2012-11-15 09:48:08', '2012-11-15 10:23:49', '2012-11-15 15:40:32', '2012-11-15 15:46:32', '2012-11-15 22:22:44', '2012-11-15 22:31:00', '2012-11-16 10:42:13') 
ets3 <- c('2012-11-16 10:52:58', '2012-11-16 12:09:57', '2012-11-16 14:58:55', '2012-11-16 14:55:55', '2012-11-16 18:14:49', '2012-11-19 10:17:12', '2012-11-19 10:33:59', '2012-11-19 16:07:49', '2012-11-19 21:59:01', '2012-11-19 22:24:58', '2012-11-19 22:31:59', '2012-11-20 10:21:02', '2012-11-20 10:37:51', '2012-11-20 11:14:44', '2012-11-20 13:28:35', '2012-11-20 14:40:16', '2012-11-20 15:10:16', '2012-11-21 10:06:50', '2012-11-21 15:14:47', '2012-11-21 15:30:55', '2012-11-21 17:55:48', '2012-11-22 01:45:42', '2012-11-22 10:25:45', '2012-11-22 10:59:45', '2012-11-22 11:10:30', '2012-11-22 12:09:07', '2012-11-22 15:12:19', '2012-11-22 15:26:18', '2012-11-22 16:51:54', '2012-11-22 18:17:25', '2012-11-23 00:41:13', '2012-11-23 10:28:57', '2012-11-23 10:01:57', '2012-11-23 12:57:33', '2012-11-23 14:20:47', '2012-11-23 14:38:47', '2012-11-23 16:57:43', '2012-11-23 18:06:38', '2012-11-26 10:37:28', '2012-11-26 10:05:28', '2012-11-27 10:30:43', '2012-11-27 10:44:43', '2012-11-27 11:54:59', '2012-11-27 19:46:56', '2012-11-28 09:33:52', '2012-11-28 09:44:52', '2012-11-28 12:57:42', '2012-11-28 13:38:45', '2012-11-28 19:45:20', '2012-11-28 22:18:43', '2012-11-30 11:45:40', '2012-11-30 11:51:40', '2012-11-30 15:05:54', '2012-11-30 15:20:00', '2012-11-30 17:42:59', '2012-11-30 22:15:48', '2012-11-30 22:39:48', '2011-11-28 10:42:55', '2011-11-28 10:49:00', '2011-11-28 14:41:54', '2011-11-28 15:04:00', '2011-11-28 20:20:59', '2011-11-29 12:10:37', '2011-11-29 12:19:00', '2011-11-29 13:25:32', '2011-11-29 15:22:57', '2011-11-29 15:49:00', '2011-11-29 16:32:23', '2011-11-30 10:27:58', '2011-11-30 10:38:58', '2011-11-30 13:05:31', '2011-11-30 14:55:24', '2011-11-30 15:11:00', '2011-11-30 16:41:09', '2011-11-30 15:08:00', '2011-12-01 11:19:43', '2011-12-01 11:29:43', '2011-12-01 14:36:38', '2011-12-02 12:31:10', '2011-12-02 12:37:10', '2011-12-02 14:50:19', '2011-12-02 15:24:00', '2011-12-02 19:40:50', '2011-12-05 12:17:58', '2011-12-05 12:26:02', '2011-12-05 15:13:55', '2011-12-05 15:42:00', '2011-12-05 19:24:16', '2011-12-06 11:32:49', '2011-12-06 11:38:51', '2011-12-06 14:55:18', '2011-12-06 15:18:18', '2011-12-06 19:22:55', '2011-12-07 11:17:14', '2011-12-07 11:22:35', '2011-12-07 14:13:34', '2011-12-07 14:41:00', '2011-12-07 20:38:18', '2011-12-08 11:28:24', '2011-12-08 11:35:55', '2011-12-09 11:03:09', '2011-12-09 11:09:08')
etsF <- c(ets1, ets2, ets3)

eatData <- tibble(case_id=eci, 
                  activity=factor(c(ea1, ea2)), 
                  activity_instance_id=eaii, 
                  lifecycle_id=factor(elci), 
                  resource=factor("UNDEFINED"), 
                  timestamp=as.POSIXct(etsF)
                  )

eat_patterns <- eventlog(eatData,
    case_id = "case_id",
    activity_id = "activity",
    activity_instance_id = "activity_instance_id",
    lifecycle_id = "lifecycle_id",
    timestamp = "timestamp",
    resource_id = "resource")


# Create performance map
eat_patterns %>% process_map(type = performance(FUN = median, units = "hours"))
# Inspect variation in activity durations graphically
eat_patterns %>% processing_time(level = "activity") %>% plot()

# Draw dotted chart
eat_patterns %>% dotted_chart(x = "relative_day", sort = "start_day", units = "secs")
## Joining, by = "case_id"

# Time per activity
# daily_activities %>% processing_time(level = "activity") %>% plot

# Average duration of recordings
# daily_activities %>% throughput_time(level="log", units = "hours")

# Missing activities
# daily_activities %>% idle_time(level="log", units = "hours")


# Distribution throughput time
# vacancies %>% throughput_time(units="days")

# Distribution throughput time per department
# vacancies %>% group_by(vacancy_department) %>% throughput_time(units="days") %>% plot()

# Repetitions of activities
# vacancies %>% number_of_repetitions(level = "activity") %>% arrange(-relative)

Chapter 3 - Event Data Processing

Filtering cases:

  • Sometimes there are too many cases, too many activities, missing data, and the like
    • Can filter by either cases or events (time periods or specific activity types)
    • Three levels of cases - performance, control-flow, and time frame
  • Look at long cases for what went wrong, and short cases for what to mimic
    • filter_throughput_time(log, interval = c(5,10)) # absolute case length is 5-10 days
    • filter_throughput_time(log, percentage = 0.5) # shortest 50% of the cases
    • filter_throughput_time(log, interval = c(5,10), units = “days”, reverse =TRUE) # cases that are NOT 5-10 days
    • filter_throughput_time(log, interval = c(5,NA), units = “days”) # cases longer than 5 days
  • Control-flow filters can be based on activity presence/absence, timing, and the like

Filtering events - trim, frequency, label, general attribute:

  • Can trim to a time period based on start or end
    • filter_time_period(log, interval = ymd(c(“20180110”,“20180122”)), filter_method = “trim”) # discards everything else
  • Can trim based on a specific start and end activities
    • filter_trim(start_activities = “blues”) # traces that have no blues will be discarded
    • filter_trim(start_activities = “blues”, end_activities = “greens”) # traces that do not have blues followed by greens will be discarded
    • Can set reverse=TRUE to get the opposites of these
  • Can filter by frequencies by either activity or resource
    • filter_activity_frequency(log, interval = c(50,100))
    • filter_activity_frequency(log, percentage = 0.8)
    • filter_resource_frequency(log, interval = c(60,900))
    • filter_resource_frequency(log, percentage = 0.6)
  • Can filter by labels
    • filter_activity(log, activities = c(“reds”,“oranges”,“purples”)))
    • dplyr::filter(log, cost > 1000, priority == “High”, …)

Aggregating events - Is-A and Part-of:

  • The Is-A is when there are many subtypes of activity that are really all part of a main activity
    • act_unite(log, “New name” = c(“Old Variant 1”,“Old Variant 2”,“Old Variant 3”), …) # same number of activity instances, just fewer names
  • The Part-of is when there are clearly distinct activities that can also be considered components of a higher-level activity
    • act_collapse(log, “Sub process” = c(“Part 1”,“Part 2”,“Part 3”), …) # fewer number of activity instances, as they are collapsed to a single activity

Enriching events - mutation (adding calculated variables):

  • The dplyr::mutate() can be used to directly add variables such as the cost
    • log %>% group_by_case() %>% mutate(total_cost = sum(cost, na.rm = TRUE) # group_by_case() is a function applied to event logs
    • log %>% group_by_case() %>% mutate(total_cost = sum(cost, na.rm = TRUE) %>% mutate(impact = case_when(cost <= 1000 ~ “Low”, cost <= 5000 ~ “Medium”, TRUE ~ “High”))
    • log %>% group_by_case() %>% mutate(refund_made = any(str_detect(activity, “Pay Claim”)))
  • Metric functions can be used directly, with apped=TRUE, to both calculate the metric and add to the event log
    • log %>% througput_time(level = “case”, units = “days”, append = TRUE) %>% mutate(on_time = processing_time_case <= 7)

Example code includes:

# Select top 20% of cases according to trace frequency
happy_path <- filter_trace_frequency(vacancies, percentage = 0.2)

# Visualize using process map
happy_path %>% process_map(type=requency(value = "absolute_case"))

# Compute throughput time
happy_path %>% throughput_time(units="days")


# Find no_declines
no_declines <- filter_activity_presence(vacancies, activities = "Decline Candidate", reverse=TRUE)

# What is the average number of  
first_hit <- filter_activity_presence(vacancies, activities = c("Send Offer", "Offer Accepted"), method="all")

# Create a performance map
first_hit %>% process_map(type=performance())

# Compute throughput time
first_hit %>% throughput_time()


# Create not_refused
not_refused <- vacancies %>% filter_precedence(antecedents = "Receive Response", consequents = "Review Non Acceptance", precedence_type = "directly_follows", filter_method = "none") 

# Select longest_cases
worst_cases <- not_refused %>% filter_throughput_time(interval=c(300, NA))

# Show the different traces
worst_cases %>% trace_explorer(coverage=1)


# Select activities
disapprovals <- vacancies %>% filter_activity(activities=c("Construct Offer", "Disapprove Offer", "Revise Offer","Disapprove Revision", "Restart Procedure"))

# Explore traces
disapprovals %>% trace_explorer(coverage=0.8)

# Performance map
disapprovals %>% process_map(type = performance(FUN = sum, units = "weeks"))


# Select cases
high_paid <- vacancies %>% filter(vacancy_department=="R&D", vacancy_salary_range==">100000")

# Most active resources
high_paid %>% resource_frequency(level="resource")

# Create a dotted chart
high_paid %>% dotted_chart(x="absolute", sort="start")

# Filtered dotted chart
library(lubridate)
high_paid %>% filter_time_period(interval = ymd(c("20180321","20180620")), filter_method = "trim") %>% dotted_chart(x="absolute", sort="start")


# Count activities and instances
n_activities(vacancies)
n_activity_instances(vacancies)

# Combine activities
united_vacancies <- vacancies %>% 
    act_unite("Disapprove Contract Offer" = c("Disapprove Offer","Disapprove Revision"),
              "Approve Contract Offer" = c("Approve Offer","Approve Revision"), 
              "Construct Contract Offer" = c("Construct Offer","Revise Offer")
              )
              
# Count activities and instances
n_activities(united_vacancies)
n_activity_instances(united_vacancies)


# Aggregate sub processes
aggregated_vacancies <- act_collapse(united_vacancies, 
                            "Interviews" = c("First Interview","Second Interview","Third Interview"),
                            "Prepare Recruitment" = c("Publish Position","File Applications","Check References"),
                            "Create Offer" = c("Construct Contract Offer", "Disapprove Contract Offer", "Approve Contract Offer")
                            )

# Calculated number of activities and activity instances
n_activities(aggregated_vacancies)
n_activity_instances(aggregated_vacancies)

# Create performance map
aggregated_vacancies %>% process_map(type=performance())


# Add total_cost
vacancies_cost <- vacancies %>% 
    group_by_case() %>% 
    mutate(total_cost = sum(activity_cost, na.rm = TRUE))

# Add cost_impact
vacancies_impact <- vacancies_cost %>%




# Compute throughput time per impact
vacancies_impact %>% group_by(cost_impact) %>% throughput_time(units = "weeks") %>% plot()


# Create cost_profile
vacancies_profile <- vacancies_impact %>%
    mutate(cost_profile = case_when(cost_impact == "High" & urgency < 7 ~ "Disproportionate",
                                    cost_impact == "Medium" & urgency < 5 ~ "Excessive",
                                    cost_impact == "Low" & urgency > 6 ~ "Lacking",
                                    TRUE ~ "Appropriate")) 

# Compare number of cases 
vacancies_profile %>% 
    group_by(cost_profile) %>%
    n_cases()
    
# Explore lacking traces
vacancies_profile %>%
  filter(cost_profile == "Lacking") %>%
  process_map()

Chapter 4 - Case Study

Preparing the event data - example includes data from Sales, Purchasing, Manufacturing, Packaging & Delivery, Accounting:

  • While all departments need to work together, it is common for each department to have different data, business rules, relational data, etc.
  • Need to create event data first prior to running anything in the bupar package
  • Various field names (ends in _at or _by) may indicate the timing and resource levels
  • The tidyverse tools are helpful for creating the initial data

Getting to know the process:

  • Identify data sources, transform so that each row is an event, harmonize them, create an eventlog
  • Start with high-level understanding of the process - summary(otc)
    • activity_presence(otc) %>% plot()
    • trace_length(otc) %>% plot()
    • start_activities(otc, “activity”) %>% plot()
    • end_activities(otc, “activity”) %>% plot()

Roles and rules:

  • Parallel activities can be run in any order, which can cause an explosion in the number of traces - collapsing can help with abstraction
  • Research questions may be related to performance, compliance, etc.
  • The “4-eye” pricniple says that certain activities should not be performed by the same person

Fast production, fast delivery:

  • Dotted charts can show the progression of the cases - request for quotation may be declined, or the offer may only be sent (no response)
  • May want to look at the performance by stages (sub-groups of activities), for more fair comparisons

Course recap:

  • Process maps
  • Process analytics
  • Data preprocessing
  • Analysis and use cases

Example code includes:

quotations <- readRDS("./RInputFiles/otc_quotations.RDS")

# Inspect quotations
str(quotations)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1833 obs. of  17 variables:
##  $ quotation_id            : chr  "quo-1003" "quo-1004" "quo-1006" "quo-1008" ...
##  $ cancelled_at            : chr  "2017-05-22 13:28:04" NA NA NA ...
##  $ cancelled_by            : Factor w/ 20 levels "Amy","Andrea",..: 10 NA NA NA 8 NA NA NA NA NA ...
##  $ manufactContacted_at    : chr  "2017-04-22 17:58:11" "2017-06-18 13:47:50" "2017-10-28 13:55:51" NA ...
##  $ manufactContacted_by    : Factor w/ 20 levels "Amy","Andrea",..: 11 11 11 NA NA NA 11 14 NA NA ...
##  $ received_at             : chr  "2017-04-16 20:34:12" "2017-06-09 11:19:31" "2017-10-14 18:55:47" "2017-09-08 13:29:05" ...
##  $ received_by             : Factor w/ 20 levels "Amy","Andrea",..: 2 8 8 8 8 8 10 8 2 2 ...
##  $ reminded_at             : chr  "2017-05-14 19:06:41" NA NA NA ...
##  $ reminded_by             : Factor w/ 20 levels "Amy","Andrea",..: 8 NA NA NA 8 NA 8 8 NA NA ...
##  $ send_at                 : chr  "2017-05-08 14:20:30" "2017-07-02 18:50:58" "2017-11-09 11:27:11" NA ...
##  $ send_by                 : Factor w/ 20 levels "Amy","Andrea",..: 10 2 2 NA 2 NA 2 2 NA 2 ...
##  $ supplierContacted_at    : chr  "2017-04-29 13:43:18" "2017-06-20 12:19:31" "2017-10-26 18:06:29" NA ...
##  $ supplierContacted_by    : Factor w/ 20 levels "Amy","Andrea",..: 14 11 11 NA 11 NA 11 14 NA 14 ...
##  $ supplierOfferReceived_at: chr  "2017-05-03 19:09:21" "2017-06-23 19:33:10" "2017-10-30 10:36:44" NA ...
##  $ supplierOfferReceived_by: Factor w/ 20 levels "Amy","Andrea",..: 14 11 14 NA 14 NA 14 14 NA 14 ...
##  $ warehouseContacted_at   : chr  "2017-04-24 19:36:10" "2017-06-15 19:30:07" "2017-10-22 17:57:26" NA ...
##  $ warehouseContacted_by   : Factor w/ 20 levels "Amy","Andrea",..: 11 11 11 NA 14 NA 11 14 NA 14 ...
# Create offer_history
offer_history <- quotations %>%
    gather(key, value, -quotation_id) %>%
    separate(key, into = c("activity", "info"))
## Warning: attributes are not identical across measure variables;
## they will be dropped
# Recode the key variable
offer_history <- offer_history %>%
    mutate(info = fct_recode(info,  "timestamp" = 'at',  "resource" = 'by'))

# Spread the info variable
offer_history <- offer_history %>%
    spread(info, value)


validations <- readRDS("./RInputFiles/otc_validations.RDS")

# Inspect validations
str(validations)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1833 obs. of  4 variables:
##  $ quotation_id: chr  "quo-1003" "quo-1004" "quo-1006" "quo-1008" ...
##  $ resource    : chr  "Jonathan" "Andrea" "Katherine" "Andrea" ...
##  $ started     : chr  "2017-04-17 14:59:08" "2017-06-11 13:10:45" "2017-10-16 15:59:18" "2017-09-09 17:58:39" ...
##  $ completed   : chr  "2017-04-19 18:32:57" "2017-06-13 12:18:57" "2017-10-18 16:21:56" "2017-09-12 20:58:14" ...
# Create validate_history
validate_history <- validations %>%
    mutate(
        activity = "Validate",
        action = paste(quotation_id, "validate",  sep = "-"))

# Gather the timestamp columns
validate_history <- validate_history  %>%
    gather(lifecycle, timestamp, started, completed)


# Recode the lifecycle column of validate_history
validate_history <- validate_history %>%
    mutate(lifecycle = fct_recode(lifecycle,
                "start" = "started",
                "complete" = "completed"))


# Add lifecycle and action column to offer_history
offer_history <- offer_history %>%
    mutate(
        lifecycle = "complete",
        action = paste(quotation_id, 1:n(), sep = "-"))

# Create sales_history
sales_history <- bind_rows(validate_history, offer_history)
## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
sales_history <- readRDS("./RInputFiles/otc_sales_history.RDS")
order_history <- readRDS("./RInputFiles/otc_order_history.RDS")
# sales_quotations <- readRDS("./RInputFiles/otc_sales_quotation.RDS")

str(sales_history)
## Classes 'tbl_df', 'tbl' and 'data.frame':    14695 obs. of  7 variables:
##  $ quotation_id  : chr  "quo-1003" "quo-1004" "quo-1006" "quo-1008" ...
##  $ resource      : chr  "Jonathan" "Andrea" "Katherine" "Andrea" ...
##  $ activity      : chr  "Validate" "Validate" "Validate" "Validate" ...
##  $ action        : chr  "quo-1003-validate" "quo-1004-validate" "quo-1006-validate" "quo-1008-validate" ...
##  $ lifecycle     : chr  "start" "start" "start" "start" ...
##  $ timestamp     : chr  "2017-04-17 14:59:08" "2017-06-11 13:10:45" "2017-10-16 15:59:18" "2017-09-09 17:58:39" ...
##  $ sales_order_id: chr  NA "order-17-56548" "order-17-56550" NA ...
str(order_history)
## Classes 'tbl_df', 'tbl' and 'data.frame':    60804 obs. of  8 variables:
##  $ sales_order_id: chr  "order-17-56542" "order-17-56542" "order-17-56543" "order-17-56543" ...
##  $ action        : chr  "order-17-56542-0000001" "order-17-56542-0000002" "order-17-56543-0000003" "order-17-56543-0000004" ...
##  $ activity      : Factor w/ 37 levels "Assemble Order",..: 24 35 24 35 24 35 24 35 24 35 ...
##  $ resource      : Factor w/ 20 levels "Amy","Andrea",..: 10 8 2 8 2 8 10 8 2 8 ...
##  $ status        : Factor w/ 2 levels "complete","start": 2 2 2 2 2 2 2 2 2 2 ...
##  $ time          : POSIXct, format: "2017-10-17 12:37:22" "2017-10-19 15:30:40" ...
##  $ activity_cost : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
##  $ quotation_id  : chr  NA NA NA NA ...
# str(sales_quotations)

order_history <- order_history %>% 
    rename(timestamp=time, lifecycle=status) %>%
    select(-activity_cost) %>%
    mutate(activity=as.character(activity), 
           resource=as.character(activity), 
           lifecycle=as.character(lifecycle)
           )
sales_history <- sales_history %>%
    mutate(timestamp=lubridate::as_datetime(timestamp))

# sales_history <- sales_history %>% left_join(sales_quotations)
otc <- bind_rows(sales_history, order_history)


# Create the eventlog object 
otc <- otc %>%
    mutate(case_id = paste(quotation_id, sales_order_id, sep = "-")) %>%
    eventlog(
        case_id = "case_id",
        activity_id = "activity",
        activity_instance_id = "action",
        timestamp = "timestamp",
        resource_id = "resource",
        lifecycle_id = "lifecycle"
        )

# Create trace coverage graph
trace_coverage(otc, level="trace") %>% plot()

# Explore traces
otc %>%
    trace_explorer(coverage = 0.25)

# Collapse activities
otc_high_level <- act_collapse(otc, "Delivery" = c(
  "Handover To Deliverer",
  "Order Delivered",
  "Present For Collection",
  "Order Fetched")
  )

# Draw a process map
process_map(otc_high_level)
# Redraw the trace coverage graph
otc_high_level %>% trace_coverage(level="trace") %>% plot()

# Compute activity wise processing time
otc_high_level %>% processing_time(level="activity", units="days")
## # A tibble: 34 x 11
##    activity   min    q1  mean median    q3   max st_dev   iqr  total
##    <fct>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
##  1 Packagi~ 0      0     0      0     0      0     0     0        0 
##  2 Prepare~ 0      0     0      0     0      0     0     0        0 
##  3 Produce~ 0      0     0      0     0      0     0     0        0 
##  4 Quality~ 0      0     0      0     0      0     0     0        0 
##  5 Assembl~ 0      0     0      0     0      0     0     0        0 
##  6 Delivery 0.583  1.99  5.11   3.11  8.06  17.0   3.86  6.07 15452.
##  7 Order M~ 0      0     0      0     0      0     0     0        0 
##  8 Receive~ 0      0     0      0     0      0     0     0        0 
##  9 Receive~ 0      0     0      0     0      0     0     0        0 
## 10 Schedul~ 0      0     0      0     0      0     0     0        0 
## # ... with 24 more rows, and 1 more variable: relative_frequency <dbl>
# Plot a resource activity matrix of otc (does not work in R 3.5.3)
# otc %>% resource_frequency(level = "resource-activity") %>% plot()


# Create otc_selection
otc_selection <- otc %>% filter_activity(activities = c("Send Quotation","Send Invoice"))

# Explore traces
otc %>% trace_explorer(coverage=1)

# Draw a resource map
otc_selection %>% resource_map()
# Create otc_returned
otc_returned <- otc %>% filter_activity_presence("Return Goods")

# Compute percentage of returned orders
n_cases(otc_returned)/n_cases(otc)
## [1] 0.2130923
# Trim cases and visualize
otc_returned %>% filter_trim(start_activities="Return Goods") %>% process_map()
# Time from order to delivery
# otc %>% filter_trim(start_activities="Receive Sales Order", end_activities="Order Delivered") %>% 
#     processing_time(units="days")


# Plot processing time by type
# otc %>%
#     group_by(type) %>%
#     throughput_time() %>%
#     plot()

Network Science in R - A Tidy Approach

Chapter 1 - Hubs of the Network

Network science - include social networks, neural networks, etc.:

  • Nodes and edges (connections between nodes, aka “ties”) make up a network
    • In a directed network, ties have a direction (for example, followers and follwing)
    • In an undirected network, ties do not have a direction (for example, mutual friendship)
    • In a weighted network, the ties have an associated weight (such as bandwidth, duration of friendship, etc.)
  • Chapter will focus on the terrorism network associated with the Madrid train bombing of 2004
    • Ties include friendhsip, training camps, previous attacks, and other terrorists
  • The network is reflected in tidy fashion, using one data frame for nodes and another for ties
    • g <- igraph::graph_from_data_frame(d = ties, directed = FALSE, vertices = nodes)
    • V(g); vcount(g)
    • E(g); ecount(g)
  • And, then working with attributes of the network
    • g\(name <- "Madrid network"; g\)name
    • V(g)$id <- 1:vcount(g)
    • E(g)$weight

Visualizing networks:

  • The ggraph package can help with visualizing networks
    • ggraph(g, layout = “with_kk”) + geom_edge_link(aes(alpha = weight)) + geom_node_point()
    • Much like the language of ggplot2

Centrality measures:

  • Objective is to find the most important nodes - connections among members of the networks
  • Network science is a spinoff of data science, with the goal of measuring networks
  • The agree of “degree” measures the number of ties (edges) that a node has
    • degree(g) # gives the number of edges per node
    • strength(g) # sumes the weights of the edges per node

Example code includes:

# read the nodes file into the variable nodes
nodes <- readr::read_csv("./RInputFiles/nodes.csv")
nodes

# read the ties file into the variable ties
ties <- readr::read_csv("./RInputFiles/ties.csv")
ties


library(igraph)
library(ggraph)


# make the network from the data frame ties and print it
g <- graph_from_data_frame(ties, directed = FALSE, vertices = nodes)
g

# explore the set of nodes
V(g)

# print the number of nodes
vcount(g)

# explore the set of ties
E(g)

# print the number of ties
ecount(g)


# give the name "Madrid network" to the network and print the network `name` attribute
g$name <- "Madrid network"
g$name

# add node attribute id and print the node `id` attribute
V(g)$id <- 1:vcount(g)
V(g)$id

# print the tie `weight` attribute
E(g)$weight

# print the network and spot the attributes
g


# visualize the network with layout Kamada-Kawai
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = weight)) + 
  geom_node_point()


# add an id label to nodes
ggraph(g, layout = "with_kk") +
  geom_edge_link(aes(alpha = weight)) +
  geom_node_point()  + 
  geom_node_text(aes(label = id), repel=TRUE)


# visualize the network with circular layout. Set tie transparency proportional to its weight
ggraph(g, layout = "in_circle") + 
  geom_edge_link(aes(alpha = weight)) + 
  geom_node_point()


# visualize the network with grid layout. Set tie transparency proportional to its weight
ggraph(g, layout = "grid") + 
  geom_edge_link(aes(alpha = weight)) + 
  geom_node_point()


# compute the degrees of the nodes
dgr <- degree(g)

# add the degrees to the data frame object
nodes <- mutate(nodes, degree = dgr)

# add the degrees to the network object
V(g)$degree <- dgr

# arrange the terrorists in decreasing order of degree
arrange(nodes, -degree)


# compute node strengths
stg <- strength(g)

# add strength to the data frame object using mutate
nodes <- mutate(nodes, strength = stg)

# add the variable stg to the network object as strength
V(g)$strength <- stg

# arrange terrorists in decreasing order of strength and then in decreasing order of degree
arrange(nodes, -degree)
arrange(nodes, -strength)

Chapter 2 - Weakness and strength

Tie betweenness:

  • Betweeness is the number of shortest paths that go through a specific tie (edge) - these removals would be the most disruptive
  • In a weighted network, the shortest path is defined as the lowest sum of weights, rather than the fewest edges
    • Often need to inverse the weights prior to running, since a “high” weight usually means a close connection and thus an easy path
    • dist_weight = 1 / E(g)$weight
    • edge_betweenness(g, weights = dist_weight)

Visualizing centrality measures:

  • Visualizing betweenness can be done within the igraph package
    • ggraph(g, layout = “with_kk”) + geom_edge_link(aes(alpha = betweenness)) + geom_node_point()
    • ggraph(g, layout = “with_kk”) + geom_edge_link(aes(alpha = weight)) + geom_node_point(aes(size = degree))

The strength of weak ties:

  • “The strength of weak ties” is a research paper written about network strengths
    • Argument is that the “weak ties” in a network are often the most important - relationships between diverse communities, leading to diverse ideas
    • The “strong ties” are the relationships between people who are frequently together - can lead to group-think and stasis
    • Noted that the Madrid group (and similar) tended to be highly dispersed and thus having many weak ties
    • ties %>% group_by(weight) %>% summarise(n = n(), p = n / nrow(ties)) %>% arrange(-n)

Example code includes:

# save the inverse of tie weights as dist_weight
dist_weight <- 1 / E(g)$weight

# compute weighted tie betweenness
btw <- edge_betweenness(g, weights = dist_weight)

# mutate the data frame ties adding a variable betweenness using btw
ties <- mutate(ties, betweenness=btw)

# add the tie attribute betweenness to the network
E(g)$betweenness <- btw


# join ties with nodes
ties_joined <- ties %>% 
  left_join(nodes, c("from" = "id")) %>% 
  left_join(nodes, c("to" = "id")) 

# select only relevant variables and save to ties
ties_selected <- ties_joined %>% 
  select(from, to, name_from = name.x, name_to = name.y, betweenness)

# arrange named ties in decreasing order of betweenness
arrange(ties_selected, -betweenness)


# set (alpha) proportional to weight and node size proportional to degree
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha=weight)) + 
  geom_node_point(aes(size=degree))

# produce the same visualization but set node size proportional to strength
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = weight)) + 
  geom_node_point(aes(size = strength))


# visualize the network with tie transparency proportional to betweenness
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = betweenness)) + 
  geom_node_point()

# add node size proportional to degree
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = betweenness)) + 
  geom_node_point(aes(size = degree))


# find median betweenness
q = median(E(g)$betweenness)

# filter ties with betweenness larger than the median
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = betweenness, filter = (betweenness > q))) + 
  geom_node_point() + 
  theme(legend.position="none")


# find number and percentage of weak ties
ties %>%
  group_by(weight) %>%
  summarise(number = n(), percentage=n()/nrow(.)) %>%
  arrange(-number)


# build vector weakness containing TRUE for weak ties
weakness <- ifelse(ties$weight == 1, TRUE, FALSE)

# check that weakness contains the correct number of weak ties
sum(weakness)


# visualize the network by coloring the weak and strong ties
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(color = weakness)) + 
  geom_node_point()


# visualize the network with only weak ties using the filter aesthetic
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(filter=weakness), alpha = 0.5) + 
  geom_node_point()

Chapter 3 - Connection patterns

Connection patterns:

  • The adjacency matrix can be calculated using as_adjacency_matrix(g)
    • For each match of row/column, there will be a 1 for adjacency and a 0 for non-adjacency
    • Alternately, can have the weight of the tie as the entry for each row/column (with 0 as before meaning non-adjacency)
    • A = as_adjacency_matrix(g, attr = “weight”)
    • diag(A)
  • Can use the adjacency matrix to assess similarity of nodes in the matrix
    • The Pearson similarity measures the correlation between the columns in the matrix

Pearson correlation coefficient:

  • Can visualize the correlations using scatterplots
  • Can compute the correlations analytically as well
    • cor(nodes\(degree, nodes\)strength)

Most similar and most dissimilar terrorists:

  • Can use named graphs with weighted ties for a graphical representation of nodes and paths
  • Can use the adjacency matrix to reprsent the ties in a manner simplified for algebra
  • Can use the data frame format (one for nodes, and one for ties) for use with dplur and ggplot2
    • as_data_frame(g, what = “both”)
  • Can easily switch back and forth between the representations of the network
    • as_adjacency_matrix(g)
    • graph_from_adjacency_matrix(A)
    • as_data_frame(g, what = “both”)
    • graph_from_data_frame(df\(ties, vertices = df\)nodes)
    • as_data_frame(graph_from_adjacency_matrix(A), what = “both”)
    • as_adjacency_matrix(graph_from_data_frame(df\(ties, vertices = df\)nodes))

Example code includes:

# mutate ties data frame by swapping variables from and to 
ties_mutated <- mutate(ties, temp = to, to = from, from = temp) %>% select(-temp)

# append ties_mutated data frame to ties data frame
ties <- rbind(ties, ties_mutated)

# use a scatter plot to visualize node connection patterns in ties setting color aesthetic to weight
ggplot(ties, aes(x = from, y = to, color = factor(weight))) +
  geom_point() +
  labs(color = "weight")


# get the weighted adjacency matrix
A <- as_adjacency_matrix(g, attr = "weight", sparse = FALSE, names = FALSE)

# print the first row and first column of A
A[1, ]
A[, 1]

# print submatrix of the first 6 rows and columns
A[1:6, 1:6]


# obtain a vector of node strengths
rowSums(A)

# build a Boolean (0/1) matrix from the weighted matrix A
B <- ifelse(A > 0, 1, 0)

# obtain a vector of node degrees using the Boolean matrix
rowSums(B)


# compute the Pearson correlation on columns of A
S <- cor(A)

# set the diagonal of S to 0
diag(S) = 0

# print a summary of the similarities in matrix S
summary(c(S))

# plot a histogram of similarities in matrix S
hist(c(S), xlab = "Similarity", main = "Histogram of similarity")


# Scatter plot of degree and strength with regression line
ggplot(nodes, aes(x = degree, y = strength)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

# Pearson correlation coefficient 
cor(nodes$degree, nodes$strength)


# build weighted similarity network and save to h
h <- graph_from_adjacency_matrix(S, mode = "undirected", weighted = TRUE)

# convert the similarity network h into a similarity data frame sim_df
sim_df <- as_data_frame(h, what = "edges")

# map the similarity data frame to a tibble and save it as sim_tib
sim_tib <- as_tibble(sim_df)

# print sim_tib
sim_tib


# left join similarity and nodes data frames and then select and rename relevant variables
sim2 <- sim_tib %>% 
  left_join(nodes, c("from" = "id")) %>% 
  left_join(nodes, c("to" = "id")) %>%
  select(from, to, name_from = name.x, name_to = name.y, similarity = weight, 
         degree_from = degree.x, degree_to = degree.y, strength_from = strength.x, strength_to = strength.y)
  
# print sim2
sim2


# arrange sim2 in decreasing order of similarity. 
sim2 %>% arrange(-similarity)

# filter sim2, allowing only pairs with a degree of least 10, arrange the result in decreasing order of similarity
sim2 %>%
  filter(degree_from >= 10, degree_to >= 10) %>%
  arrange(-similarity)

# Repeat the previous steps, but in increasing order of similarity
sim2 %>%
  filter(degree_from >= 10, degree_to >= 10) %>%
  arrange(similarity)


# filter the similarity data frame to similarities larger than or equal to 0.60
sim3 <- filter(sim2, similarity >= 0.6)

# build a similarity network called h2 from the filtered similarity data frame
h2 <- graph_from_data_frame(sim3, directed = FALSE)

# visualize the similarity network h2
ggraph(h2, layout = "with_kk") + 
  geom_edge_link(aes(alpha = similarity)) + 
  geom_node_point()

Chapter 4 - Similarity Clusters

Hierarchical clustering - find clusters of similar people:

  • Basic idea is to define a measure of similarity, then match the most similar entities to groups, proceeding until there is a single cluster containing everyone
  • The dendrogram (tree diagram) is helpful for viewing this data
  • The similarity measure between individual nodes (person similarity) exists, and needs to be extended to groups
    • Single-linkage - similarity is the maximum of the similarities of anyone in the groups
    • Complete-linkage - similarity is the minimum of the similarities of anyone in the groups
    • Average-linkage - similarity is the average of the simlarities of everyone in the groups
  • The clustering algorithm works as follows
    • Evaluate simlarity for all node pairs
    • Assign each node to its own group
    • Find the pair of groups with the highest simlarity, and join them
    • Calculate simlarity of this newly formed group to all previously existing entities (groups or individuals)
    • Repeat until there is just a single cluster remaining
  • The R implementation is hclust()
    • D <- 1-S
    • d <- as.dist(D)
    • cc <- hclust(d, method = “average”)
    • cls <- cutree(cc, k = 4)

Interactive visualizations with visNetwork:

  • visNetwork is an interactive package for viewing networks
    • Many different layouts are available, and you can interact with the nodes and the ties
    • Can select nodes and see their neighborhoods (nodes within a certain distance)
    • Can select nodes by name
    • Can partition nodes in to groups and color, highlight, etc.

Wrap up:

  • Analysis of networks with measures of centrality and similarity
  • Visualization of networks, including interactivity

Example code includes:

# compute a distance matrix
D <- 1 - S

# obtain a distance object 
d <- as.dist(D)

# run average-linkage clustering method and plot the dendrogram 
cc <- hclust(d, method = "average")
plot(cc)

# find the similarity of the first pair of nodes that have been merged 
S[40, 45]


# cut the dendrogram at 4 clusters
cls <- cutree(cc, k = 4)

# add cluster information to the nodes data frame
nodes <- mutate(nodes, cluster = cls)

# print the nodes data frame
nodes


# output the names of terrorists in the first cluster
filter(nodes, cluster == 1) %>% 
    select(name)

# for each cluster select the size of the cluster, the average node degree, and the average node strength and sorts by cluster size
group_by(nodes, cluster) %>%
  summarise(size = n(), 
            avg_degree = mean(degree),
            avg_strength = mean(strength)
            ) %>%
  arrange(-size)


# add cluster information to the network 
V(g)$cluster <- nodes$cluster

# visualize the original network with colored clusters
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = weight), show.legend=FALSE) + 
  geom_node_point(aes(color = factor(cluster))) +
  labs(color = "cluster")

# facet the network with respect to cluster attribute
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = weight), show.legend=FALSE) + 
  geom_node_point(aes(color = factor(cluster))) +
  facet_nodes(~cluster, scales="free")  +
  labs(color = "cluster")


# convert igraph to visNetwork
data <- visNetwork::toVisNetworkData(g)

# print head of nodes and ties
head(data$nodes)
head(data$edges)

# visualize the network
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300)


# use the circle layout
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_with_kk")

# use the circle layout
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_in_circle")

# use the grid layout
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_on_grid")


# highlight nearest nodes and ties of the selected node
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_with_kk") %>%
  visNetwork::visOptions(highlightNearest = TRUE) 


# select nodes by id 
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_with_kk") %>%
  visNetwork::visOptions(nodesIdSelection = TRUE)

# set color to cluster and generate network data
V(g)$color = V(g)$cluster
data <- visNetwork::toVisNetworkData(g)

# select by group (cluster)
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_with_kk") %>%
  visNetwork::visOptions(selectedBy = "group")

Data Privacy and Anaonymization in R

Chapter 1 - Introduction to Data Privacy

Intro to Anonymization - Part I:

  • Need to implement better data privacy techniques - e.g., census data, healthcare data, etc.
  • Need to have data such as individualized health, but not in a manner that identifies specific individuals
  • Topics covered in this course will include
    • Remove identifiers, synthesize data
    • Laplace mechnaism for removing names
    • Differential privacy and post-processing
    • Release of data using the above techniques
  • Data sets will include White House salaries and male infertility data
    • One basic technique is removing identifiers, such as replacing names with numbers
    • Another basic technique is to round continuous values (such as to the nearest 1000)

Intro to Anonymization - Part II:

  • Additional approaches include generalization and top/bottom coding
    • Generalization creates larger buckets of data
    • Top/bottom is about setting outliers back to a pre-defined top and bottom of the range
  • Additional dplyr functions of interest
    • count() is used to find the number of observations for each distinct group
    • whitehouse %>% count(Status)
    • whitehouse %>% count(Status, Title, sort = TRUE) # sort=TRUE sorts by descending n
    • summarize_at() lets you get summary statistics for a key variable
    • whitehouse %>% summarise_at(vars(Salary), sum) # vars() holds the bare variables, while sum is the requested function
    • whitehouse %>% summarise_at(vars(Salary), funs(mean, sd)) # funs() holds the list of functions that you want to apply

Data Synthesis:

  • Fake datasets created based on sampling from a probability distribution
  • Goal is a fake dataset (by definition anaonymized) that is statistically similar to the real dataset
    • For 1/0 data, sampling from the binomial distribution can work well
    • For bell-shaped data, the normal or log-normal can often work well (though there can be issues with bounding)
    • Hard-bounding is setting values to a proper max/min, while another approach is to discard the record and sample again

Example code includes:

load("./RInputFiles/dataPriv.RData")


# Preview data
whitehouse
## # A tibble: 469 x 5
##    Name         Status  Salary Basis   Title                               
##    <chr>        <chr>    <dbl> <chr>   <chr>                               
##  1 Abrams, Ada~ Employ~  66300 Per An~ WESTERN REGIONAL COMMUNICATIONS DIR~
##  2 Adams, Ian ~ Employ~  45000 Per An~ EXECUTIVE ASSISTANT TO THE DIRECTOR~
##  3 Agnew, Davi~ Employ~  93840 Per An~ DEPUTY DIRECTOR OF INTERGOVERNMENTA~
##  4 Albino, Jam~ Employ~  91800 Per An~ SENIOR PROGRAM MANAGER              
##  5 Aldy, Jr., ~ Employ~ 130500 Per An~ SPECIAL ASSISTANT TO THE PRESIDENT ~
##  6 Alley, Hila~ Employ~  42000 Per An~ STAFF ASSISTANT                     
##  7 Amorsingh, ~ Employ~  56092 Per An~ SPECIAL ASSISTANT                   
##  8 Anderson, A~ Employ~  60000 Per An~ SPECIAL ASSISTANT TO THE CHIEF OF S~
##  9 Anderson, C~ Employ~  51000 Per An~ POLICY ASSISTANT                    
## 10 Andrias, Ka~ Employ~ 130500 Per An~ SPECIAL ASSISTANT TO THE PRESIDENT ~
## # ... with 459 more rows
# Set seed
set.seed(42)

# Replace names with random numbers from 1 to 1000
whitehouse_no_names <- whitehouse %>%
    mutate(Name = sample(1:1000, nrow(.), replace=FALSE))

whitehouse_no_names
## # A tibble: 469 x 5
##     Name Status   Salary Basis   Title                                     
##    <int> <chr>     <dbl> <chr>   <chr>                                     
##  1   915 Employee  66300 Per An~ WESTERN REGIONAL COMMUNICATIONS DIRECTOR  
##  2   937 Employee  45000 Per An~ EXECUTIVE ASSISTANT TO THE DIRECTOR OF SC~
##  3   286 Employee  93840 Per An~ DEPUTY DIRECTOR OF INTERGOVERNMENTAL AFFA~
##  4   828 Employee  91800 Per An~ SENIOR PROGRAM MANAGER                    
##  5   640 Employee 130500 Per An~ SPECIAL ASSISTANT TO THE PRESIDENT FOR EN~
##  6   517 Employee  42000 Per An~ STAFF ASSISTANT                           
##  7   733 Employee  56092 Per An~ SPECIAL ASSISTANT                         
##  8   134 Employee  60000 Per An~ SPECIAL ASSISTANT TO THE CHIEF OF STAFF   
##  9   652 Employee  51000 Per An~ POLICY ASSISTANT                          
## 10   699 Employee 130500 Per An~ SPECIAL ASSISTANT TO THE PRESIDENT AND AS~
## # ... with 459 more rows
# Rounding Salary to the nearest ten thousand
whitehouse_no_identifiers <- whitehouse_no_names %>%
    mutate(Salary = round(Salary, -4))

whitehouse_no_identifiers
## # A tibble: 469 x 5
##     Name Status   Salary Basis   Title                                     
##    <int> <chr>     <dbl> <chr>   <chr>                                     
##  1   915 Employee  70000 Per An~ WESTERN REGIONAL COMMUNICATIONS DIRECTOR  
##  2   937 Employee  40000 Per An~ EXECUTIVE ASSISTANT TO THE DIRECTOR OF SC~
##  3   286 Employee  90000 Per An~ DEPUTY DIRECTOR OF INTERGOVERNMENTAL AFFA~
##  4   828 Employee  90000 Per An~ SENIOR PROGRAM MANAGER                    
##  5   640 Employee 130000 Per An~ SPECIAL ASSISTANT TO THE PRESIDENT FOR EN~
##  6   517 Employee  40000 Per An~ STAFF ASSISTANT                           
##  7   733 Employee  60000 Per An~ SPECIAL ASSISTANT                         
##  8   134 Employee  60000 Per An~ SPECIAL ASSISTANT TO THE CHIEF OF STAFF   
##  9   652 Employee  50000 Per An~ POLICY ASSISTANT                          
## 10   699 Employee 130000 Per An~ SPECIAL ASSISTANT TO THE PRESIDENT AND AS~
## # ... with 459 more rows
# Convert the salaries into three categories
whitehouse.gen <- whitehouse %>%
    mutate(Salary = ifelse(Salary < 50000, 0, 
                           ifelse(Salary >= 50000 & Salary < 100000, 1, 2)))

whitehouse.gen
## # A tibble: 469 x 5
##    Name         Status  Salary Basis   Title                               
##    <chr>        <chr>    <dbl> <chr>   <chr>                               
##  1 Abrams, Ada~ Employ~      1 Per An~ WESTERN REGIONAL COMMUNICATIONS DIR~
##  2 Adams, Ian ~ Employ~      0 Per An~ EXECUTIVE ASSISTANT TO THE DIRECTOR~
##  3 Agnew, Davi~ Employ~      1 Per An~ DEPUTY DIRECTOR OF INTERGOVERNMENTA~
##  4 Albino, Jam~ Employ~      1 Per An~ SENIOR PROGRAM MANAGER              
##  5 Aldy, Jr., ~ Employ~      2 Per An~ SPECIAL ASSISTANT TO THE PRESIDENT ~
##  6 Alley, Hila~ Employ~      0 Per An~ STAFF ASSISTANT                     
##  7 Amorsingh, ~ Employ~      1 Per An~ SPECIAL ASSISTANT                   
##  8 Anderson, A~ Employ~      1 Per An~ SPECIAL ASSISTANT TO THE CHIEF OF S~
##  9 Anderson, C~ Employ~      1 Per An~ POLICY ASSISTANT                    
## 10 Andrias, Ka~ Employ~      2 Per An~ SPECIAL ASSISTANT TO THE PRESIDENT ~
## # ... with 459 more rows
# Bottom Coding
whitehouse.bottom <- whitehouse %>%
    mutate(Salary = pmax(Salary, 45000))

# Filter Results
whitehouse.bottom %>%
    filter(Salary <= 45000)
## # A tibble: 109 x 5
##    Name          Status  Salary Basis   Title                              
##    <chr>         <chr>    <dbl> <chr>   <chr>                              
##  1 Adams, Ian H. Employ~  45000 Per An~ EXECUTIVE ASSISTANT TO THE DIRECTO~
##  2 Alley, Hilar~ Employ~  45000 Per An~ STAFF ASSISTANT                    
##  3 Asen, Jonath~ Employ~  45000 Per An~ SENIOR ANALYST                     
##  4 Ayling, Lind~ Employ~  45000 Per An~ ANALYST                            
##  5 Baggetto, Ma~ Employ~  45000 Per An~ STAFF ASSISTANT                    
##  6 Bates, Andre~ Employ~  45000 Per An~ MEDIA MONITOR                      
##  7 Belive, Laur~ Employ~  45000 Per An~ LEGISLATIVE ASSISTANT AND ASSISTAN~
##  8 Bisi, Rachel~ Employ~  45000 Per An~ LEGISLATIVE ASSISTANT              
##  9 Block, Micha~ Employ~  45000 Per An~ STAFF ASSISTANT                    
## 10 Blount, Patr~ Employ~  45000 Per An~ RECORDS MANAGEMENT ANALYST         
## # ... with 99 more rows
# View fertility data
fertility
## # A tibble: 100 x 10
##    Season   Age Child_Disease Accident_Trauma Surgical_Interv~ High_Fevers
##     <dbl> <dbl>         <int>           <int>            <int>       <int>
##  1  -0.33  0.69             0               1                1           0
##  2  -0.33  0.94             1               0                1           0
##  3  -0.33  0.5              1               0                0           0
##  4  -0.33  0.75             0               1                1           0
##  5  -0.33  0.67             1               1                0           0
##  6  -0.33  0.67             1               0                1           0
##  7  -0.33  0.67             0               0                0          -1
##  8  -0.33  1                1               1                1           0
##  9   1     0.64             0               0                1           0
## 10   1     0.61             1               0                0           0
## # ... with 90 more rows, and 4 more variables: Alcohol_Freq <dbl>,
## #   Smoking <int>, Hours_Sitting <dbl>, Diagnosis <int>
# Number of participants with Surgical_Intervention and Diagnosis
fertility %>%
    summarise_at(vars(Surgical_Intervention, Diagnosis), sum)
## # A tibble: 1 x 2
##   Surgical_Intervention Diagnosis
##                   <int>     <int>
## 1                    51        12
# Mean and Standard Deviation of Age
fertility %>%
    summarise_at(vars(Age), funs(mean, sd))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
##   # Before:
##   funs(name = f(.))
## 
##   # After: 
##   list(name = ~ f(.))
## This warning is displayed once per session.
## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1 0.669 0.121
# Counts of the Groups in High_Fevers
fertility %>%
    count(High_Fevers)
## # A tibble: 3 x 2
##   High_Fevers     n
##         <int> <int>
## 1          -1     9
## 2           0    63
## 3           1    28
# Counts of the Groups in Child_Disease
fertility %>%
    count(Child_Disease, Accident_Trauma)
## # A tibble: 4 x 3
##   Child_Disease Accident_Trauma     n
##           <int>           <int> <int>
## 1             0               0    10
## 2             0               1     3
## 3             1               0    46
## 4             1               1    41
# Find proportions
fertility %>%
    summarise_at(vars(Accident_Trauma, Surgical_Intervention), mean)
## # A tibble: 1 x 2
##   Accident_Trauma Surgical_Intervention
##             <dbl>                 <dbl>
## 1            0.44                  0.51
# Set seed
set.seed(42)

# Generate Synthetic data
accident <- rbinom(100, 1, prob=0.440)
surgical <- rbinom(100, 1, prob=0.510)


# Square root Transformation of Salary
whitehouse.salary <- whitehouse %>%
    mutate(Salary = sqrt(Salary))

# Calculate the mean and standard deviation
stats <- whitehouse.salary %>%
    summarize(mean(Salary), sd(Salary))

stats
## # A tibble: 1 x 2
##   `mean(Salary)` `sd(Salary)`
##            <dbl>        <dbl>
## 1           279.         71.8
# Generate Synthetic data
set.seed(42)
salary_transformed <- rnorm(nrow(whitehouse), mean=279, sd=71.8)

# Power transformation
salary_original <- salary_transformed ** 2

# Hard bound
salary <- ifelse(salary_original < 0, 0, salary_original)

Chapter 2 - Introduction to Differential Privacy

Differential Privacy - quantification of privacy loss via a privacy budget:

  • The worst-case scenario is that no assumptions are made about data intruders
    • If an individual is from a small group, their data may be 100% available by looking at statistics in aggregate and statistics for the group that excludes them (everyone else)
  • The privacy budget is defined using epsilon - smaller numbers mean that less information will be made available
  • The general concept is to look at a dataset that includes the segment the individual is in, and a dataset that includes all other segments
    • The answer sent back to the query will have noise added to it depending on the privacy budget
  • Basically, the differential privacy algorithm finds the most “unique” person in the dataset, and then decides how much noise to add based on how identifiable they are by attribute

Global Sensitivity - usual decision-making factor for differential privacy:

  • The global sensitivity of a query is the most a variable could change based on removing one individual
    • By definition, count queries always have a global sensitivity of 1 (exclude 1 individual)
    • Therefore, proportion queries always have a global sensitity of 1/n
    • Mean queries always have a global sensitivity of (max - min) / n
    • Variance queries always have a global sensitivity of (max - min)^2 / n
  • The global sensitivity and the epsilon work together to determine the amount of noise
    • Measures like median are not very sensitive to outliers, and thus very little noise needs to be added
    • Measures like maximum are very sensitive to outliers (e.g., Bill Gates income), and thus very little noise needs to be added

Laplace Mechanism - adds noise based on the Laplace distribution with mean 0 and parameters global sensitivity and privacy budget:

  • fertility %>% summarise_at(vars(Child_Disease), sum)
  • library(smoothmest) # has function rdoublex(draws, mean, shaping) - set draws=1, mean=true_mean, shaping=globalSensitivity / epsilon

Example code includes:

# Number of observations
n <- nrow(fertility)

# Global sensitivity of counts
gs.count <- 1

# Global sensitivity of proportions
gs.prop <- 1/n


# Lower bound of Hours_Sitting
a <- 0

# Upper bound of Hours_Sitting
b <- 1

# Global sensitivity of mean for Hours_Sitting
gs.mean <- (b - a) / n

# Global sensitivity of proportions Hours_Sitting
gs.var <- (b - a)**2 / n


# How many participants had a Surgical_Intervention?
fertility %>%
   summarise_at(vars(Surgical_Intervention), sum)
## # A tibble: 1 x 1
##   Surgical_Intervention
##                   <int>
## 1                    51
# Set the seed
set.seed(42)

# Apply the Laplace mechanism
eps <- 0.1
smoothmest::rdoublex(1, 51, 1/eps)
## [1] 52.98337
# Proportion of Accident_Trauma
stats <- fertility %>%
   summarise_at(vars(Accident_Trauma), mean)

stats
## # A tibble: 1 x 1
##   Accident_Trauma
##             <dbl>
## 1            0.44
# Set the seed
set.seed(42)

# Apply the Laplace mechanism
eps <- 0.1
smoothmest::rdoublex(1, 0.440, (1/n)/eps)
## [1] 0.4598337
# Mean and Variance of Hours Sitting
fertility %>%
    summarise_at(vars(Hours_Sitting), funs(mean, var))
## # A tibble: 1 x 2
##    mean    var
##   <dbl>  <dbl>
## 1 0.407 0.0347
# Setup
set.seed(42)
eps <- 0.1

# Laplace mechanism to mean
smoothmest::rdoublex(1, 0.41, gs.mean/eps)
## [1] 0.4298337
# Laplace mechanism to variance
smoothmest::rdoublex(1, 0.03, gs.var/eps)
## [1] 0.0583491

Chapter 3 - Differentially Private Properties

Sequential Composition - method to require that someone cannot find the real answer by just sending multiple queries:

  • Idea is that the privacy budget is divided by the number of queries you plan to send
  • For example, if a query will be made for mean and another query will be made for maximum, then epsilon needs to be divided by two

Parallel Composition - method to account for queries to different parts of the database (no adjustment to epsilon needed):

  • Deciding between sequential and parallel is whether queries could be answered using completely different (MECE) splits of the dataset

Post-processing:

  • When new queries can be answered using data that has already been privatized, it can be synthesized to a noisy answer to this new query
    • The privacy budget need not be adjusted in this case
    • For example, if there are three groups, can just add noise to two of the groups and let the third group be total minus these two groups

Impossible and inconsistent answers:

  • Bounding can be introduced, such as making all negative numbers zero or anything greater than the total to the total
    • rdoublex(1, 12, gs.count / eps) %>% round() %>% max(0) # lower bound is zero
    • normalized <- (smoking/sum(smoking)) * (nrow(fertility)) # upper bound is the size of the dataset

Example code includes:

# Set Value of Epsilon
eps <- 0.1 / 2

# Number of observations
n <- nrow(fertility)

# Lower bound of Age
a <- 0

# Upper bound of Age
b <- 1

# GS of counts for Diagnosis
gs.count <- 1

# GS of mean for Age
gs.mean <- (b-a)/n


# Number of Participants with abnormal diagnosis
stats1 <- fertility %>% 
    summarize_at(vars(Diagnosis), sum)

stats1
## # A tibble: 1 x 1
##   Diagnosis
##       <int>
## 1        12
# Mean of age
stats2 <- fertility %>%
    summarize_at(vars(Age), mean)

stats2
## # A tibble: 1 x 1
##     Age
##   <dbl>
## 1 0.669
# Set seed
set.seed(42)

# Laplace mechanism to the count of abnormal diagnosis
smoothmest::rdoublex(1, 12, gs.count/eps)
## [1] 15.96674
# Laplace mechanism to the mean of age
smoothmest::rdoublex(1, 0.67, gs.mean/eps)
## [1] 0.7266982
# Set Value of Epsilon
eps <- 0.1

# Mean of Age per diagnosis level 
fertility %>%
  group_by(Diagnosis) %>%
  summarise_at(vars(Age), mean)
## # A tibble: 2 x 2
##   Diagnosis   Age
##       <int> <dbl>
## 1         0 0.664
## 2         1 0.707
# Set the seed
set.seed(42)

# Laplace mechanism to the mean age of participants with an abnormal diagnoisis
smoothmest::rdoublex(1, 0.71, gs.mean/eps)
## [1] 0.7298337
# Laplace mechanism to the mean age of participants with a normal diagnoisis
smoothmest::rdoublex(1, 0.66, gs.mean/eps)
## [1] 0.6883491
# Set Value of Epsilon
eps <- 0.5/3

# GS of Counts
gs.count <- 1

# Number of participants in each of the four seasons
fertility %>%
    group_by(Diagnosis) %>%
    summarise_at(vars(Age), mean)
## # A tibble: 2 x 2
##   Diagnosis   Age
##       <int> <dbl>
## 1         0 0.664
## 2         1 0.707
# Set the seed
set.seed(42)

# Laplace mechanism to the number of participants who were evaluated in the winter, spring, and summer
winter <- smoothmest::rdoublex(1, 28, gs.count / eps) %>%
    round()

spring <- smoothmest::rdoublex(1, 37, gs.count / eps) %>%
    round()

summer <- smoothmest::rdoublex(1, 4, gs.count / eps) %>%
    round()

# Post-process based on previous queries
fall <- nrow(fertility) - winter - spring - summer


# Set Value of Epsilon
eps <- 0.01

# GS of counts
gs.count <- 1

# Number of Participants with Child_Disease
fertility %>%
    summarise_at(vars(Child_Disease), sum)
## # A tibble: 1 x 1
##   Child_Disease
##           <int>
## 1            87
# Apply the Laplace mechanism
set.seed(42)
lap_childhood <- smoothmest::rdoublex(1, 87, gs.count / eps) %>%
    round()

# Total number of observations in fertility
max_value <- nrow(fertility)

# Bound the value such that the noisy answer does not exceed the total number of observations
ifelse(lap_childhood > max_value, max_value, lap_childhood)
## [1] 100
# Set the seed
set.seed(42)

# Apply the Laplace mechanism
fever1 <- smoothmest::rdoublex(1, 9, gs.count/eps) %>%
    max(0)
fever2 <- smoothmest::rdoublex(1, 63, gs.count/eps) %>%
    max(0)
fever3 <- smoothmest::rdoublex(1, 28, gs.count/eps) %>%
    max(0)

fever <- c(fever1, fever2, fever3)

# Normalize noise 
fever_normalized <- (fever/sum(fever)) * (nrow(fertility))

# Round the values
round(fever_normalized)
## [1] 24 76  0

Chapter 4 - Differentially Private Data Synthesis

Laplace Sanitizer - basic way to generate “noisy” categorical data:

  • Takes advantage of parallel - if the data can be binned or placed in a contingency table, assumes no more need to divide the privacy budget
    • Since the data is queries as a histogram, it can be considered disjoint (non-overlapping) and thus parallel composition
  • Can generate data using rep() for a single vector

Parametric Approaches:

  • Sampling from a binomial distribution (where appropriate), with a known proportion that has been modified by Laplace differential privacy guarantee
  • Sampling from a normal or log-normal distribution (where appropriate), with a known mean and variance that has been modified by Laplace differential privacy guarantee

Wrap up:

  • Basics of anonymyzing data, such as removing names
  • Basics of modifying data such as generalizing to categorical data
  • Basics of generating synthetic data using rbinom() and rnorm()
  • Basics of privacy budgets, global sensitivities, and the Laplace mechanism
  • Basics of differential privacy, such as sequential (split epsilon) or parallel (including through binning or continegnecy tables)
  • Basics of the Laplace sanitizer for both categorical data (rbinom) and continuous data (rnorm)
  • Next steps include managing data gaps, incorrect statistics distributions with hard bounding, etc.
    • Local differential privacy (Apple) and probabilistic differential privacy (US census)
    • Techniques specific to GPS data or PCA

Example code includes:

# Set Value of Epsilon
eps <- 0.1

# GS of Counts
gs.count <- 1

# Number of participants in each season
fertility %>%
    count(Season)
## # A tibble: 4 x 2
##   Season     n
##    <dbl> <int>
## 1  -1       28
## 2  -0.33    37
## 3   0.33     4
## 4   1       31
# Set the seed
set.seed(42)

# Apply the Laplace mechanism 
winter <- smoothmest::rdoublex(1, 28, gs.count/eps) %>% max(0)
spring <- smoothmest::rdoublex(1, 37, gs.count/eps) %>% max(0)
summer <- smoothmest::rdoublex(1, 4, gs.count/eps) %>% max(0)
fall <- smoothmest::rdoublex(1, 31, gs.count/eps) %>% max(0)


# Store noisy results
seasons <- c(winter = winter, spring = spring, summer = summer, fall = fall)

# Normalizing seasons
seasons_normalized <- (seasons/sum(seasons)) * nrow(fertility)

# Round the values
round(seasons_normalized)
## winter spring summer   fall 
##     29     38      0     33
# Generate synthetic data for winter
rep(-1, 29)
##  [1] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
## [24] -1 -1 -1 -1 -1 -1
# Generate synthetic data for spring
rep(-0.33, 38)
##  [1] -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33
## [12] -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33
## [23] -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33
## [34] -0.33 -0.33 -0.33 -0.33 -0.33
# Generate synthetic data for summer
rep(0.33, 0)
## numeric(0)
# Generate synthetic data for fall
rep(1, 33)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Calculate proportions
fertility %>%
    summarise_at(vars(Accident_Trauma, Surgical_Intervention), mean)
## # A tibble: 1 x 2
##   Accident_Trauma Surgical_Intervention
##             <dbl>                 <dbl>
## 1            0.44                  0.51
# Number of Observations
n <- nrow(fertility)

# Set Value of Epsilon
eps <- 0.1

# GS of Proportion
gs.prop <- (1/n)


# Apply the Laplace mechanism
set.seed(42)
smoothmest::rdoublex(1, 0.44, gs.prop/eps)
## [1] 0.4598337
smoothmest::rdoublex(1, 0.51, gs.prop/eps)
## [1] 0.5383491
# Generate Synthetic data
set.seed(42)
accident <- rbinom(n, 1, 0.46)
surgical <- rbinom(n, 1, 0.54)


# Set Value of Epsilon
eps <- 0.1 / 2

# Number of observations
n <- nrow(fertility)

# Upper and lower bounds of age
a <- 0
b <- 1

# GS of mean and variance for age
gs.mean <- (b-a) / n
gs.var <- (b-a)**2 / n


# Mean and Variance of Age
fertility %>%
    summarise_at(vars(Age), funs(mean, var))
## # A tibble: 1 x 2
##    mean    var
##   <dbl>  <dbl>
## 1 0.669 0.0147
# Apply the Laplace mechanism
set.seed(42)
smoothmest::rdoublex(1, 0.67, gs.mean/eps)
## [1] 0.7096674
smoothmest::rdoublex(1, 0.01, gs.var/eps)
## [1] 0.06669821
# Generate Synthetic data
set.seed(42)
age <- rnorm(n, mean=0.71, sd=sqrt(0.07))

# Hard Bounding the data
age[age < 0] <- 0
age[age > 1] <- 1

Marketing Analytics in R: Statistical Modeling

Chapter 1 - Modeling Customer Lifetime Value with Linear Regression

Introduction - Verena from INWT Statistics (consultancy in marketing analytics):

  • Customer Lifetime Value (CLV) is the expected value of forecasted customer value to the company
    • CLV is based on margin, and needs to use current information to predict future margins
    • Customers predicted to have higher CLV can then be targeted
  • Can inspect the data without seeing attributes using str(clvData1, give.attr = FALSE)
  • Can derive correlations using corrplot
    • library(corrplot)
    • clvData1 %>% select(nOrders, nItems, … ,margin, futureMargin) %>% cor() %>% corrplot()

Simple linear regression - one predictor variable to predict one response variable:

  • Can run linear regressions using basic stats modules
    • simpleLM <- lm(futureMargin ~ margin, data = clvData1)
    • summary(simpleLM)
  • Can plot previous margin vs. current margin, including a linear regression (smooth)
    • ggplot(clvData1, aes(margin, futureMargin)) + geom_point() + geom_smooth(method = lm, se = FALSE) + xlab(“Margin year 1”) + ylab(“Margin year 2”)
  • Several conditions must apply for linear regression to be the best method
    • Linear relationship between x and y
    • No measurement error in x (weak exogeneity)
    • Independence of errors
    • Expectation of errors is 0
    • Constant variance of prediction errors (homoscedasticity)
    • Normality of errors

Multiple linear regression:

  • Omitted variable bias is when a variable not in the regression is correlated with both the predictor and the response variables
    • Simpson’s Paradox is an example - upward sloping becomes downward sloping after properly splitting on the extra variable
  • Multicollinearity is a threat to a linear regression - leads to unstable regression coefficients, with associated under-reporting of standard errors
    • rms::vif(myLMModel) # above 5 is concerning, above 10 almost always needs to be addressed

Model validation, fit, and prediction:

  • The R-squared is the proportion of variance in the depedent variable that is explained by the regression
  • Can look at the p-value of the F-test to assess the overall statistical significance of the model
  • There is a risk of over-fitting, when the model is overly complex and learns artifacts of the training data rather than genuine patterns
    • Can use stats::AIC() or MASS::stepAIC(), with the goal being to minimize AIC (needs to be models of the same data)
    • AIC(multipleLM2)
  • Can predict outputs automatically, such as with
    • predMargin <- predict(multipleLM2, newdata = clvData2)

Example code includes:

salesData <- readr::read_csv("./RInputFiles/salesData.csv")
## Parsed with column specification:
## cols(
##   id = col_double(),
##   nItems = col_double(),
##   mostFreqStore = col_character(),
##   mostFreqCat = col_character(),
##   nCats = col_double(),
##   preferredBrand = col_character(),
##   nBrands = col_double(),
##   nPurch = col_double(),
##   salesLast3Mon = col_double(),
##   salesThisMon = col_double(),
##   daysSinceLastPurch = col_double(),
##   meanItemPrice = col_double(),
##   meanShoppingCartValue = col_double(),
##   customerDuration = col_double()
## )
# Structure of dataset
str(salesData, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 5122 obs. of  14 variables:
##  $ id                   : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ nItems               : num  1469 1463 262 293 108 ...
##  $ mostFreqStore        : chr  "Stockton" "Stockton" "Colorado Springs" "Colorado Springs" ...
##  $ mostFreqCat          : chr  "Alcohol" "Alcohol" "Shoes" "Bakery" ...
##  $ nCats                : num  72 73 55 50 32 41 36 31 41 52 ...
##  $ preferredBrand       : chr  "Veina" "Veina" "Bo" "Veina" ...
##  $ nBrands              : num  517 482 126 108 79 98 78 62 99 103 ...
##  $ nPurch               : num  82 88 56 43 18 35 34 12 26 33 ...
##  $ salesLast3Mon        : num  2742 2791 1530 1766 1180 ...
##  $ salesThisMon         : num  1284 1243 683 730 553 ...
##  $ daysSinceLastPurch   : num  1 1 1 1 12 2 2 4 14 1 ...
##  $ meanItemPrice        : num  1.87 1.91 5.84 6.03 10.93 ...
##  $ meanShoppingCartValue: num  33.4 31.7 27.3 41.1 65.6 ...
##  $ customerDuration     : num  821 657 548 596 603 673 612 517 709 480 ...
# Visualization of correlations
salesData %>% select_if(is.numeric) %>%
  select(-id) %>%
  cor() %>%
  corrplot::corrplot()

# Frequent stores
ggplot(salesData) +
    geom_boxplot(aes(x = mostFreqStore, y = salesThisMon))

# Preferred brand
ggplot(salesData) +
    geom_boxplot(aes(x = preferredBrand, y = salesThisMon))

# Model specification using lm
salesSimpleModel <- lm(salesThisMon ~ salesLast3Mon, data = salesData)

# Looking at model summary
summary(salesSimpleModel)
## 
## Call:
## lm(formula = salesThisMon ~ salesLast3Mon, data = salesData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -570.18  -68.26    3.21   72.98  605.58 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   99.690501   6.083886   16.39   <2e-16 ***
## salesLast3Mon  0.382696   0.004429   86.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 117.5 on 5120 degrees of freedom
## Multiple R-squared:  0.5932, Adjusted R-squared:  0.5931 
## F-statistic:  7465 on 1 and 5120 DF,  p-value: < 2.2e-16
# Estimating the full model
salesModel1 <- lm(salesThisMon ~ . -id, data = salesData)

# Checking variance inflation factors
car::vif(salesModel1)
##                            GVIF Df GVIF^(1/(2*Df))
## nItems                11.772600  1        3.431122
## mostFreqStore          1.260469  9        1.012943
## mostFreqCat            1.527348  9        1.023809
## nCats                  8.402073  1        2.898633
## preferredBrand         1.682184  9        1.029316
## nBrands               14.150868  1        3.761764
## nPurch                 3.083952  1        1.756119
## salesLast3Mon          8.697663  1        2.949180
## daysSinceLastPurch     1.585057  1        1.258991
## meanItemPrice          1.987665  1        1.409846
## meanShoppingCartValue  2.247579  1        1.499193
## customerDuration       1.004664  1        1.002329
# Estimating new model by removing information on brand
salesModel2 <- lm(salesThisMon ~ . -id -preferredBrand -nBrands, data = salesData)

# Checking variance inflation factors
car::vif(salesModel2)
##                           GVIF Df GVIF^(1/(2*Df))
## nItems                6.987456  1        2.643380
## mostFreqStore         1.178251  9        1.009154
## mostFreqCat           1.269636  9        1.013351
## nCats                 5.813494  1        2.411119
## nPurch                3.069046  1        1.751869
## salesLast3Mon         8.412520  1        2.900435
## daysSinceLastPurch    1.579426  1        1.256752
## meanItemPrice         1.925494  1        1.387622
## meanShoppingCartValue 2.238410  1        1.496132
## customerDuration      1.002981  1        1.001489
salesData2_4 <- readr::read_csv("./RInputFiles/salesDataMon2To4.csv")
## Parsed with column specification:
## cols(
##   id = col_double(),
##   nItems = col_double(),
##   mostFreqStore = col_character(),
##   mostFreqCat = col_character(),
##   nCats = col_double(),
##   preferredBrand = col_character(),
##   nBrands = col_double(),
##   nPurch = col_double(),
##   salesLast3Mon = col_double(),
##   daysSinceLastPurch = col_double(),
##   meanItemPrice = col_double(),
##   meanShoppingCartValue = col_double(),
##   customerDuration = col_double()
## )
# getting an overview of new data
summary(salesData2_4)
##        id           nItems       mostFreqStore      mostFreqCat       
##  Min.   :   1   Min.   :   1.0   Length:5173        Length:5173       
##  1st Qu.:1372   1st Qu.:  84.0   Class :character   Class :character  
##  Median :2733   Median : 155.0   Mode  :character   Mode  :character  
##  Mean   :2729   Mean   : 185.9                                        
##  3rd Qu.:4085   3rd Qu.: 257.0                                        
##  Max.   :5455   Max.   :1461.0                                        
##      nCats       preferredBrand        nBrands           nPurch     
##  Min.   : 1.00   Length:5173        Min.   :  1.00   Min.   : 1.00  
##  1st Qu.:27.00   Class :character   1st Qu.: 45.00   1st Qu.:11.00  
##  Median :37.00   Mode  :character   Median : 75.00   Median :17.00  
##  Mean   :36.23                      Mean   : 81.66   Mean   :20.02  
##  3rd Qu.:46.00                      3rd Qu.:110.00   3rd Qu.:27.00  
##  Max.   :74.00                      Max.   :484.00   Max.   :86.00  
##  salesLast3Mon  daysSinceLastPurch meanItemPrice     meanShoppingCartValue
##  Min.   : 189   Min.   : 1.000     Min.   :  1.879   Min.   :  17.58      
##  1st Qu.:1068   1st Qu.: 2.000     1st Qu.:  6.049   1st Qu.:  53.88      
##  Median :1331   Median : 4.000     Median :  8.556   Median :  75.77      
##  Mean   :1324   Mean   : 6.589     Mean   : 12.116   Mean   :  91.88      
##  3rd Qu.:1570   3rd Qu.: 7.000     3rd Qu.: 12.969   3rd Qu.: 109.74      
##  Max.   :2745   Max.   :87.000     Max.   :313.050   Max.   :1147.66      
##  customerDuration
##  Min.   :  31.0  
##  1st Qu.: 580.0  
##  Median : 682.0  
##  Mean   : 676.8  
##  3rd Qu.: 777.0  
##  Max.   :1386.0
# predicting sales
predSales5 <- predict(salesModel2, newdata = salesData2_4)

# calculating mean of future sales
mean(predSales5)
## [1] 625.1438

Chapter 2 - Logistic Regression for Churn Prevention

Churn prevention in online marketing:

  • Objective is to predict the likelihood of a customer repeating their business, assessed using logistic regression
    • Model the log-odds (defined as log (P(Y=1) / P(Y=0))) as a linear function of the inputs
    • Convert the log-odds to odds (defined as P(Y=1) / P(Y=0)) by exponentiation
    • Convert the odds to a probability of churning, using odds / (1 + odds)
  • Can begin with basic data exploration
    • ggplot(churnData, aes(x = returnCustomer)) + geom_histogram(stat = “count”)

Modeling and model selection:

  • The logit model can be run using the GLM provided in R
    • logitModelFull <- glm(returnCustomer ~ title + newsletter + websiteDesign + …, family = binomial, churnData)
  • Interpreting the coefficients is not easy - they are related to the log-odds
    • Can exponentiate the coefficients to get their impact on the odds
    • Can then interpret that greater than 1 means “more likely, all else equal”
  • Can use MASS::stepAIC() to help refine the modeling
    • library(MASS)
    • logitModelNew <- stepAIC(logitModelFull, trace = 0)
    • summary(logitModelNew)
    • Produces a model with fewer variables and a lower AIC

In-sample model fit and thresholding:

  • There are three types of pseudo-R-squared statistics available for the results of logistical regression
    • McFadden: R-squared = 1 - L(null) / L(full)
    • Cox-Snell: R-squared = 1 - (L(null) / L(full)) ** (2/n)
    • Nagelkerke: R-squared = [1 - (L(null) / L(full)) ** (2/n)] / [1 - L(null) ** (2/n)]
    • Generally, anything above 0.2 is reasonably good
    • descr::LogRegR2(logitModelNew)
    • library(SDMTools)
    • churnData$predNew <- predict(logitModelNew, type = “response”, na.action = na.exclude) # get the prediction probabilities
    • data %>% select(returnCustomer, predNew) %>% tail()
    • confMatrixNew <- confusion.matrix(churnData\(returnCustomer, churnData\)predNew, threshold = 0.5) # this is the version from SDMTools
  • Can give different weights to the different errors (false negatives, false positives, etc.)
    • Can instead look at a payoff, defined based on scalars for the various quadrants

Out-of-sample validation and cross validation:

  • Begin by randomly splitting data in to training (roughly two-thirds) and holding back the remainder for validation (roughly one-third)
    • set.seed(534381)
    • churnData$isTrain <- rbinom(nrow(churnData), 1, 0.66)
    • train <- subset(churnData, churnData$isTrain == 1)
    • test <- subset(churnData, churnData$isTrain == 0)
    • test$predNew <- predict(logitTrainNew, type = “response”, newdata = test) # make predictions only on the test dataset
  • Cross-validation is an even more powerful tool for assessing out-of-sample error
    • Split the data in to k subsets, and run the model k times with k-1 training data and the last subset used as the validation data
    • Acc03 <- function(r, pi = 0) {
    • cm <- confusion.matrix(r, pi, threshold = 0.3)
    • acc <- sum(diag(cm)) / sum(cm) return(acc)
    • }
    • set.seed(534381)
    • boot::cv.glm(churnData, logitModelNew, cost = Acc03, K = 6)$delta
  • Can continually tweak the model to see if transforms, variable additions, etc., might tend to improve the out-of-sample error rate

Example code includes:

defaultData <- readr::read_delim("./RInputFiles/defaultData.csv", delim=";")
## Parsed with column specification:
## cols(
##   .default = col_double()
## )
## See spec(...) for full column specifications.
# Summary of data
summary(defaultData)
##        ID           limitBal            sex          education    
##  Min.   :    1   Min.   :  10000   Min.   :1.000   Min.   :0.000  
##  1st Qu.: 4501   1st Qu.:  50000   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 9000   Median : 130000   Median :2.000   Median :2.000  
##  Mean   : 9000   Mean   : 162902   Mean   :1.588   Mean   :1.835  
##  3rd Qu.:13500   3rd Qu.: 230000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :18000   Max.   :1000000   Max.   :2.000   Max.   :6.000  
##     marriage         age             pay1               pay2        
##  Min.   :0.00   Min.   :21.00   Min.   :-2.00000   Min.   :-2.0000  
##  1st Qu.:1.00   1st Qu.:28.00   1st Qu.:-1.00000   1st Qu.:-1.0000  
##  Median :2.00   Median :34.00   Median : 0.00000   Median : 0.0000  
##  Mean   :1.56   Mean   :35.48   Mean   : 0.02783   Mean   :-0.1017  
##  3rd Qu.:2.00   3rd Qu.:41.00   3rd Qu.: 0.00000   3rd Qu.: 0.0000  
##  Max.   :3.00   Max.   :75.00   Max.   : 8.00000   Max.   : 8.0000  
##       pay3              pay4              pay5             pay6        
##  Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.000   Min.   :-2.0000  
##  1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.000   1st Qu.:-1.0000  
##  Median : 0.0000   Median : 0.0000   Median : 0.000   Median : 0.0000  
##  Mean   :-0.1294   Mean   :-0.1974   Mean   :-0.228   Mean   :-0.2567  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.000   3rd Qu.: 0.0000  
##  Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.000   Max.   : 8.0000  
##     billAmt1          billAmt2         billAmt3          billAmt4      
##  Min.   :-165580   Min.   :-33350   Min.   : -34041   Min.   :-170000  
##  1st Qu.:   3675   1st Qu.:  3149   1st Qu.:   2655   1st Qu.:   2245  
##  Median :  22450   Median : 21425   Median :  20035   Median :  18703  
##  Mean   :  50030   Mean   : 48131   Mean   :  45607   Mean   :  41074  
##  3rd Qu.:  65001   3rd Qu.: 62157   3rd Qu.:  58457   3rd Qu.:  50540  
##  Max.   : 964511   Max.   :983931   Max.   :1664089   Max.   : 891586  
##     billAmt5         billAmt6          payAmt1          payAmt2       
##  Min.   :-37594   Min.   :-339603   Min.   :     0   Min.   :      0  
##  1st Qu.:  1684   1st Qu.:   1150   1st Qu.:   949   1st Qu.:    696  
##  Median : 18046   Median :  16780   Median :  2087   Median :   2000  
##  Mean   : 39398   Mean   :  38009   Mean   :  5532   Mean   :   5731  
##  3rd Qu.: 49355   3rd Qu.:  48442   3rd Qu.:  5000   3rd Qu.:   5000  
##  Max.   :927171   Max.   : 961664   Max.   :505000   Max.   :1684259  
##     payAmt3          payAmt4          payAmt5            payAmt6      
##  Min.   :     0   Min.   :     0   Min.   :     0.0   Min.   :     0  
##  1st Qu.:   307   1st Qu.:   228   1st Qu.:   209.8   1st Qu.:     2  
##  Median :  1500   Median :  1486   Median :  1500.0   Median :  1400  
##  Mean   :  4629   Mean   :  4757   Mean   :  4763.7   Mean   :  5135  
##  3rd Qu.:  4000   3rd Qu.:  4000   3rd Qu.:  4000.0   3rd Qu.:  4000  
##  Max.   :896040   Max.   :497000   Max.   :417990.0   Max.   :528666  
##  PaymentDefault  
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2306  
##  3rd Qu.:0.0000  
##  Max.   :1.0000
# Look at data structure
str(defaultData, give.attr=FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 18000 obs. of  25 variables:
##  $ ID            : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ limitBal      : num  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ sex           : num  2 2 2 2 1 1 1 2 2 1 ...
##  $ education     : num  2 2 2 2 2 1 1 2 3 3 ...
##  $ marriage      : num  1 2 2 1 1 2 2 2 1 2 ...
##  $ age           : num  24 26 34 37 57 37 29 23 28 35 ...
##  $ pay1          : num  2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ pay2          : num  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ pay3          : num  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ pay4          : num  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ pay5          : num  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ pay6          : num  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ billAmt1      : num  3913 2682 29239 46990 8617 ...
##  $ billAmt2      : num  3102 1725 14027 48233 5670 ...
##  $ billAmt3      : num  689 2682 13559 49291 35835 ...
##  $ billAmt4      : num  0 3272 14331 28314 20940 ...
##  $ billAmt5      : num  0 3455 14948 28959 19146 ...
##  $ billAmt6      : num  0 3261 15549 29547 19131 ...
##  $ payAmt1       : num  0 0 1518 2000 2000 ...
##  $ payAmt2       : num  689 1000 1500 2019 36681 ...
##  $ payAmt3       : num  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ payAmt4       : num  0 1000 1000 1100 9000 ...
##  $ payAmt5       : num  0 0 1000 1069 689 ...
##  $ payAmt6       : num  0 2000 5000 1000 679 ...
##  $ PaymentDefault: num  1 1 0 0 0 0 0 0 0 0 ...
# Analyze the balancedness of dependent variable
ggplot(defaultData, aes(x = PaymentDefault)) +
  geom_histogram(stat = "count") 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

# Build logistic regression model
logitModelFull <- glm(PaymentDefault ~ limitBal + sex + education + marriage +
                   age + pay1 + pay2 + pay3 + pay4 + pay5 + pay6 + billAmt1 + 
                   billAmt2 + billAmt3 + billAmt4 + billAmt5 + billAmt6 + payAmt1 + 
                   payAmt2 + payAmt3 + payAmt4 + payAmt5 + payAmt6, 
                family = "binomial", data = defaultData)

# Take a look at the model
summary(logitModelFull)
## 
## Call:
## glm(formula = PaymentDefault ~ limitBal + sex + education + marriage + 
##     age + pay1 + pay2 + pay3 + pay4 + pay5 + pay6 + billAmt1 + 
##     billAmt2 + billAmt3 + billAmt4 + billAmt5 + billAmt6 + payAmt1 + 
##     payAmt2 + payAmt3 + payAmt4 + payAmt5 + payAmt6, family = "binomial", 
##     data = defaultData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0893  -0.7116  -0.5615  -0.2794   4.2501  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.711e-01  1.505e-01  -3.795 0.000148 ***
## limitBal    -4.825e-07  1.985e-07  -2.431 0.015052 *  
## sex         -8.251e-02  3.880e-02  -2.127 0.033457 *  
## education   -1.217e-01  2.745e-02  -4.434 9.23e-06 ***
## marriage    -1.711e-01  4.016e-02  -4.259 2.05e-05 ***
## age          4.824e-03  2.257e-03   2.137 0.032570 *  
## pay1         5.743e-01  2.221e-02  25.864  < 2e-16 ***
## pay2         5.156e-02  2.552e-02   2.020 0.043336 *  
## pay3         7.811e-02  2.863e-02   2.728 0.006375 ** 
## pay4        -1.191e-02  3.285e-02  -0.363 0.716838    
## pay5         1.080e-01  3.381e-02   3.193 0.001406 ** 
## pay6        -1.956e-02  2.750e-02  -0.711 0.476852    
## billAmt1    -7.948e-06  1.582e-06  -5.023 5.09e-07 ***
## billAmt2     4.911e-06  2.006e-06   2.448 0.014350 *  
## billAmt3     4.203e-07  1.698e-06   0.247 0.804572    
## billAmt4    -1.587e-08  1.872e-06  -0.008 0.993234    
## billAmt5     9.703e-07  2.154e-06   0.451 0.652293    
## billAmt6     6.758e-07  1.591e-06   0.425 0.670955    
## payAmt1     -1.878e-05  3.252e-06  -5.777 7.61e-09 ***
## payAmt2     -6.406e-06  2.364e-06  -2.710 0.006731 ** 
## payAmt3     -3.325e-06  2.401e-06  -1.385 0.166153    
## payAmt4     -3.922e-06  2.342e-06  -1.675 0.093970 .  
## payAmt5     -2.383e-06  2.168e-06  -1.099 0.271635    
## payAmt6     -1.916e-06  1.618e-06  -1.184 0.236521    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19438  on 17999  degrees of freedom
## Residual deviance: 17216  on 17976  degrees of freedom
## AIC: 17264
## 
## Number of Fisher Scoring iterations: 5
# Take a look at the odds
coefsexp <- coef(logitModelFull) %>% exp() %>% round(2)
coefsexp
## (Intercept)    limitBal         sex   education    marriage         age 
##        0.56        1.00        0.92        0.89        0.84        1.00 
##        pay1        pay2        pay3        pay4        pay5        pay6 
##        1.78        1.05        1.08        0.99        1.11        0.98 
##    billAmt1    billAmt2    billAmt3    billAmt4    billAmt5    billAmt6 
##        1.00        1.00        1.00        1.00        1.00        1.00 
##     payAmt1     payAmt2     payAmt3     payAmt4     payAmt5     payAmt6 
##        1.00        1.00        1.00        1.00        1.00        1.00
# The old (full) model
logitModelFull <- glm(PaymentDefault ~ limitBal + sex + education + marriage +
                   age + pay1 + pay2 + pay3 + pay4 + pay5 + pay6 + billAmt1 + 
                   billAmt2 + billAmt3 + billAmt4 + billAmt5 + billAmt6 + payAmt1 + 
                   payAmt2 + payAmt3 + payAmt4 + payAmt5 + payAmt6, 
                 family = binomial, defaultData)

#Build the new model
logitModelNew <- MASS::stepAIC(logitModelFull, trace=0) 

#Look at the model
summary(logitModelNew) 
## 
## Call:
## glm(formula = PaymentDefault ~ limitBal + sex + education + marriage + 
##     age + pay1 + pay2 + pay3 + pay5 + billAmt1 + billAmt2 + billAmt5 + 
##     payAmt1 + payAmt2 + payAmt3 + payAmt4, family = binomial, 
##     data = defaultData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0839  -0.7119  -0.5611  -0.2839   4.1800  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.699e-01  1.504e-01  -3.790 0.000151 ***
## limitBal    -5.201e-07  1.954e-07  -2.661 0.007791 ** 
## sex         -8.206e-02  3.878e-02  -2.116 0.034338 *  
## education   -1.212e-01  2.744e-02  -4.418 9.96e-06 ***
## marriage    -1.724e-01  4.014e-02  -4.295 1.75e-05 ***
## age          4.863e-03  2.256e-03   2.156 0.031092 *  
## pay1         5.740e-01  2.218e-02  25.882  < 2e-16 ***
## pay2         4.979e-02  2.552e-02   1.951 0.051048 .  
## pay3         7.197e-02  2.573e-02   2.798 0.005146 ** 
## pay5         8.859e-02  2.249e-02   3.938 8.20e-05 ***
## billAmt1    -8.130e-06  1.580e-06  -5.144 2.69e-07 ***
## billAmt2     5.238e-06  1.775e-06   2.951 0.003165 ** 
## billAmt5     1.790e-06  8.782e-07   2.038 0.041554 *  
## payAmt1     -1.931e-05  3.258e-06  -5.928 3.06e-09 ***
## payAmt2     -6.572e-06  2.092e-06  -3.142 0.001681 ** 
## payAmt3     -3.693e-06  2.187e-06  -1.689 0.091241 .  
## payAmt4     -4.611e-06  2.062e-06  -2.237 0.025306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19438  on 17999  degrees of freedom
## Residual deviance: 17220  on 17983  degrees of freedom
## AIC: 17254
## 
## Number of Fisher Scoring iterations: 5
# Save the formula of the new model (it will be needed for the out-of-sample part) 
formulaLogit <- as.formula(summary(logitModelNew)$call)
formulaLogit
## PaymentDefault ~ limitBal + sex + education + marriage + age + 
##     pay1 + pay2 + pay3 + pay5 + billAmt1 + billAmt2 + billAmt5 + 
##     payAmt1 + payAmt2 + payAmt3 + payAmt4
# Make predictions using the full Model
defaultData$predFull <- predict(logitModelFull, type = "response", na.action = na.exclude)

# Construct the in-sample confusion matrix
confMatrixModelFull <- SDMTools::confusion.matrix(defaultData$PaymentDefault, 
                                                  defaultData$predFull, 
                                                  threshold = 0.5
                                                  )
confMatrixModelFull
##     obs
## pred     0    1
##    0 13441 3154
##    1   409  996
## attr(,"class")
## [1] "confusion.matrix"
# Calculate the accuracy for the full Model
accuracyFull <- sum(diag(confMatrixModelFull)) / sum(confMatrixModelFull)
accuracyFull
## [1] 0.8020556
# Calculate the accuracy for 'logitModelNew'
# Make prediction
defaultData$predNew <- predict(logitModelNew, type = "response", na.action = na.exclude)

# Construct the in-sample confusion matrix
confMatrixModelNew <- SDMTools::confusion.matrix(defaultData$PaymentDefault, 
                                                 defaultData$predNew, 
                                                 threshold = 0.5
                                                 )
confMatrixModelNew
##     obs
## pred     0    1
##    0 13443 3152
##    1   407  998
## attr(,"class")
## [1] "confusion.matrix"
# Calculate the accuracy...
accuracyNew <- sum(diag(confMatrixModelNew)) / sum(confMatrixModelNew)
accuracyNew
## [1] 0.8022778
# and compare it to the full model's accuracy
accuracyFull
## [1] 0.8020556
accuracyNew
## [1] 0.8022778
# Prepare data frame with threshold values and empty payoff column
payoffMatrix <- data.frame(threshold = seq(from = 0.1, to = 0.5, by = 0.1), payoff = NA) 
payoffMatrix
##   threshold payoff
## 1       0.1     NA
## 2       0.2     NA
## 3       0.3     NA
## 4       0.4     NA
## 5       0.5     NA
for(i in 1:length(payoffMatrix$threshold)) {
  # Calculate confusion matrix with varying threshold
  confMatrix <- SDMTools::confusion.matrix(defaultData$PaymentDefault, 
                                           defaultData$predNew, 
                                           threshold = payoffMatrix$threshold[i]
                                           )
  # Calculate payoff and save it to the corresponding row
  payoffMatrix$payoff[i] <- confMatrix[1, 1]*250 + confMatrix[1, 2]*(-1000)
}
payoffMatrix
##   threshold payoff
## 1       0.1 306750
## 2       0.2 752750
## 3       0.3 888000
## 4       0.4 641250
## 5       0.5 208750
# Split data in train and test set
set.seed(534381) 
defaultData$isTrain <- rbinom(nrow(defaultData), 1, 0.66)
train <- subset(defaultData, isTrain == 1)
test <- subset(defaultData, isTrain  == 0)

logitTrainNew <- glm(formulaLogit, family = binomial, data = train) # Modeling
test$predNew <- predict(logitTrainNew, type = "response", newdata = test) # Predictions

# Out-of-sample confusion matrix and accuracy
confMatrixModelNew <- SDMTools::confusion.matrix(test$PaymentDefault, test$predNew, threshold = 0.3) 
sum(diag(confMatrixModelNew)) / sum(confMatrixModelNew) # Compare this value to the in-sample accuracy
## [1] 0.7797764
# Accuracy function
costAcc <- function(r, pi = 0) {
  cm <- SDMTools::confusion.matrix(r, pi, threshold = 0.3)
  acc <- sum(diag(cm)) / sum(cm)
  return(acc)
}

# Cross validated accuracy for logitModelNew
set.seed(534381)
boot::cv.glm(defaultData, logitModelNew, cost = costAcc, K = 6)$delta[1]
## [1] 0.7862778

Chapter 3 - Modeling Time to Reorder with Survival Analysis

Survival Analysis Introduction:

  • Often have “censored” data, meaning that the customer journeys are not yet complete
    • Random Type I Right censoring is the most common - a point can only be observed if it has occurred before time X, and it is otherwise unknowable (but known that they have not yet churned)
    • Can plot histograms of whether someone has churned depending on the length of time
    • plotTenure <- dataSurv %>% mutate(churn = churn %>% factor(labels = c(“No”, “Yes”))) %>%
    • ggplot() + geom_histogram(aes(x = tenure, fill = factor(churn))) + facet_grid( ~ churn) +
    • theme(legend.position = “none”)
  • Survival analysis attempts to estimate when something will happen (churn, second order, renewal, etc.)

Survival curve analysis by Kaplan-Meier:

  • Begin by creating a new object containing the survival attribute
    • cbind(dataSurv %>% select(tenure, churn), surv = Surv(dataSurv\(tenure, dataSurv\)churn)) %>% head(10)
  • The survival function is the probability of “no event” in cumulative by time t
    • The hazard function is the cumulative probability of “event” by time t
    • The “hazard rate” is the probability of the event happening in a small time, provided that it has not yet happened
  • The Kaplan-Meier analysis can be used to estimate survival
    • fitKM <- survival::survfit(Surv(dataSurv\(tenure, dataSurv\)churn) ~ 1, type = “kaplan-meier”)
    • print(fitKM) # gives a few rough summary statistics
    • plot(fitKM) # survival curve with confidence interval
    • fitKMstr <- survfit(Surv(tenure, churn) ~ Partner, data = dataSurv) # add covariates, such as ~ Partner rather than ~1 as in the baseline

Cox PH model with constant covariates:

  • Model definition: cannot parse to ISO - see Excel notes
    • Predictors are lineary and multiplicatively related to the hazard function, lambda
    • Relative hazard function needs to remain constant over time
  • Fitting a survival model in R
    • library(rms)
    • units(dataSurv$tenure) <- “Month”
    • dd <- datadist(dataSurv)
    • options(datadist = “dd”)
    • fitCPH1 <- cph(Surv(tenure, churn) ~ gender + SeniorCitizen + Partner + Dependents + StreamMov + PaperlessBilling + PayMeth + MonthlyCharges, data = dataSurv, x = TRUE, y = TRUE, surv = TRUE, time.inc = 1)
    • Coefficient interpretation is relatively similar to logistic regression - exp(fitCPH1$coefficients) - can simplify the coefficients be making them multiplicative (1.00 is no impact)
    • survplot(fitCPH1, MonthlyCharges, label.curves = list(keys = 1:5)) # plots the survival probabilities based on varying 1 variable, assuming other variables constant
    • survplot(fitCPH1, Partner) # covariate with partner, plotted
    • plot(summary(fitCPH1), log = TRUE) # visualizing the hazard ratios

Checking model assumptions and making predictions:

  • Can again use the Cox PH function
    • testCPH1 <- cox.zph(fitCPH1)
    • print(testCPH1) # if p < 0.05, can reject the assumption that the predictor meets the proportional hazard assumption
    • plot(testCPH1, var = “Partner=Yes”)
    • plot(testCPH1, var = “MonthlyCharges”)
    • This test is conservative and sensitive to the number of observations
  • If the PH (proportional hazard) assumptions are violated, can correct for this using
    • fitCPH2 <- cph(Surv(tenure, churn) ~ MonthlyCharges + SeniorCitizen + Partner + Dependents + StreamMov + Contract, stratum = “gender = Male”, data = dataSurv, x = TRUE, y = TRUE, surv = TRUE)
    • rms::validate(fitCPH1, method = “crossvalidation”, B = 10, pr = FALSE) # pr=FALSE means only print at the end; R2 is the R-squared corrected by cross-validation
  • Can then assess probabilities for the event to occur
    • oneNewData <- data.frame(gender = “Female”, SeniorCitizen = “Yes”, Partner = “No”, Dependents = “Yes”, StreamMov = “Yes”, PaperlessBilling = “Yes”, PayMeth = “BankTrans(auto)”, MonthlyCharges = 37.12)
    • str(survest(fitCPH1, newdata = oneNewData, times = 3))
    • plot(survfit(fitCPH1, newdata = oneNewData))
    • print(survfit(fitCPH1, newdata = oneNewData))

Example code includes:

survData <- readr::read_delim("./RInputFiles/survivalDataExercise.csv", delim=",")
## Parsed with column specification:
## cols(
##   daysSinceFirstPurch = col_double(),
##   shoppingCartValue = col_double(),
##   gender = col_character(),
##   voucher = col_double(),
##   returned = col_double(),
##   boughtAgain = col_double()
## )
dataNextOrder <- survData %>%
    select(daysSinceFirstPurch, boughtAgain)

# Look at the head of the data
head(dataNextOrder)
## # A tibble: 6 x 2
##   daysSinceFirstPurch boughtAgain
##                 <dbl>       <dbl>
## 1                  37           0
## 2                  63           1
## 3                  48           0
## 4                  17           1
## 5                  53           0
## 6                  11           1
# Plot a histogram
ggplot(dataNextOrder) +
  geom_histogram(aes(x = daysSinceFirstPurch, fill = factor(boughtAgain))) +
  facet_grid( ~ boughtAgain) + # Separate plots for boughtAgain = 1 vs. 0
  theme(legend.position = "none") # Don't show legend
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Create survival object
survObj <- survival::Surv(dataNextOrder$daysSinceFirstPurch, dataNextOrder$boughtAgain)

# Look at structure
str(survObj)
##  'Surv' num [1:5122, 1:2]  37+  63   48+  17   53+  11   22   16   74+  44  ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "time" "status"
##  - attr(*, "type")= chr "right"
# Compute and print fit
fitKMSimple <- survival::survfit(survObj ~ 1)
print(fitKMSimple)
## Call: survfit(formula = survObj ~ 1)
## 
##       n  events  median 0.95LCL 0.95UCL 
##    5122    3199      41      40      42
# Plot fit
plot(fitKMSimple, conf.int = FALSE, xlab = "Time since first purchase", 
     ylab = "Survival function", main = "Survival function"
     )

dataNextOrder <- survData %>%
    select(daysSinceFirstPurch, boughtAgain, voucher)

# Compute fit with categorical covariate
fitKMCov <- survival::survfit(survObj ~ voucher, data = dataNextOrder)

# Plot fit with covariate and add labels
plot(fitKMCov, lty = 2:3, xlab = "Time since first purchase", 
     ylab = "Survival function", main = "Survival function"
     )
legend(90, .9, c("No", "Yes"), lty = 2:3)

dataNextOrder <- survData

# Determine distributions of predictor variables
dd <- rms::datadist(dataNextOrder)
options(datadist = "dd")

# Compute Cox PH Model and print results
fitCPH <- rms::cph(survival::Surv(daysSinceFirstPurch, boughtAgain) ~ 
                       shoppingCartValue + voucher + returned + gender, data = dataNextOrder, 
                   x = TRUE, y = TRUE, surv = TRUE
                   )
print(fitCPH)
## Cox Proportional Hazards Model
##  
##  rms::cph(formula = survival::Surv(daysSinceFirstPurch, boughtAgain) ~ 
##      shoppingCartValue + voucher + returned + gender, data = dataNextOrder, 
##      x = TRUE, y = TRUE, surv = TRUE)
##  
##                       Model Tests       Discrimination    
##                                            Indexes        
##  Obs       5122    LR chi2    155.68    R2       0.030    
##  Events    3199    d.f.            4    Dxy      0.116    
##  Center -0.2808    Pr(> chi2) 0.0000    g        0.238    
##                    Score chi2 140.57    gr       1.269    
##                    Pr(> chi2) 0.0000                      
##  
##                    Coef    S.E.   Wald Z Pr(>|Z|)
##  shoppingCartValue -0.0021 0.0003 -7.56  <0.0001 
##  voucher           -0.2945 0.0480 -6.14  <0.0001 
##  returned          -0.3145 0.0495 -6.36  <0.0001 
##  gender=male        0.1080 0.0363  2.97  0.0029  
## 
# Interpret coefficients
exp(fitCPH$coefficients)
## shoppingCartValue           voucher          returned       gender=male 
##         0.9978601         0.7449362         0.7301667         1.1140891
# Plot result summary
plot(summary(fitCPH), log = TRUE)

# Check proportional hazard assumption and print result
testCPH <- survival::cox.zph(fitCPH)
print(testCPH)
##                       rho chisq      p
## shoppingCartValue -0.0168 0.907 0.3409
## voucher           -0.0155 0.770 0.3803
## returned           0.0261 2.182 0.1397
## gender=male        0.0390 4.922 0.0265
## GLOBAL                 NA 8.528 0.0740
# Plot time-dependent beta
plot(testCPH, var = "gender=male")

# Validate model
rms::validate(fitCPH, method = "crossvalidation", B = 10, dxy = TRUE, pr = FALSE)
##       index.orig training   test optimism index.corrected  n
## Dxy       0.1159   0.1160 0.1145   0.0014          0.1144 10
## R2        0.0299   0.0300 0.0288   0.0013          0.0287 10
## Slope     1.0000   1.0000 0.9733   0.0267          0.9733 10
## D         0.0032   0.0033 0.0042  -0.0009          0.0041 10
## U         0.0000   0.0000 0.0002  -0.0002          0.0002 10
## Q         0.0032   0.0033 0.0040  -0.0007          0.0040 10
## g         0.2380   0.2382 0.2320   0.0062          0.2318 10
# Create data with new customer
newCustomer <- data.frame(daysSinceFirstPurch = 21, shoppingCartValue = 99.9, gender = "female", 
                          voucher = 1, returned = 0, stringsAsFactors = FALSE
                          )

# Make predictions
pred <- survival::survfit(fitCPH, newdata = newCustomer)
print(pred)
## Call: survfit(formula = fitCPH, newdata = newCustomer)
## 
##       n  events  median 0.95LCL 0.95UCL 
##    5122    3199      47      44      49
plot(pred)

# Correct the customer's gender
newCustomer2 <- newCustomer
newCustomer2$gender <- "male"

# Redo prediction
pred2 <- survival::survfit(fitCPH, newdata = newCustomer2)
print(pred2)
## Call: survfit(formula = fitCPH, newdata = newCustomer2)
## 
##       n  events  median 0.95LCL 0.95UCL 
##    5122    3199      44      42      47

Chapter 4 - Reducing Dimensionality with Principal Component Analysis

PCA for CRM Data - address mutlicollinearity and data volume issues in the raw CRM data:

  • PCA reduces a large number of correlated variables to a smaller number of uncorrelated (orthogonal) variables
  • PCA can also help with creating an index, such as using the first component of the PCA
  • All variables must be either continuous or binary prior to running the PCA analysis
    • dataCustomers %>% cor() %>% corrplot() # plot the initial correlations

PCA Computation:

  • Need to manage for variance, otherwise high-variance variables will be over-represented in the PCA
    • lapply(dataCustomers, var)
    • dataCustomers <- dataCustomers %>% scale() %>% as.data.frame()
    • pcaCust <- prcomp(dataCustomers)
    • pcaCust$sdev %>% round(2) # standard deviations by component
    • pcaCust$sdev ^ 2 %>% round(2) # variances, also known as eigenvalues, by component give a good sense for relative importance (relative ratio is percent of variance explained)
    • round(pcaCust$rotation[, 1:6], 2) # correlations between original variables and principal components (can use these to give descriptive names to components)
  • Values of the observations are the weightings for the PC to make up the underlying data
    • sum(dataCustomers[1,] * pcaCust$rotation[,1]) # Value on 1st component for 1st customer
    • pcaCust$x[1:5, 1:6] # first 5 customers and first 6 component loadings (weightings)

PCA Model Specification:

  • Need to decide on how many components to keep - balance size of data vs. reconstruction of original data
    • Can set a minimum requirement for percentage of variance explained (such as 70%)
    • summary(pcaCust) # will show cumulatives also
    • Can use the Kaiser-Guttman criteria, which keeps only components with an eigenvalue of 1 (since 1 is the average)
    • Can also draw a scree plot to see the variances (eigenvalues) in descending order - look for an elbow
    • screeplot(pcaCust, type = “lines”)
    • Generally, use a few different techniques, and pick a number that is “in the range”
  • The biplot can help to show how the data map on to the principal components
    • biplot(pcaCust, choices = 1:2, cex = 0.7) # will show PC1 and PC2, with arrows for the various features and how they map on them

Principal components in a regression analysis:

  • PCA can help to solve the multi-collinearity problem in a regression
    • dataCustComponents <- cbind(dataCustomers[, “customerSatis”], pcaCust$x[, 1:6]) %>% as.data.frame
    • mod2 <- lm(customerSatis ~ ., dataCustComponents)
    • vif(mod2) # by construction, these will all be 1, since the principal components are orthogonal
  • Factor analysis is another dimension-reduction technique, sometimes confused with PCA
    • Factor analysis theorizes that latent constructs (e.g., intelligence) which cannot be directly measured are influencing the observed variables
    • Factor analysis is often used in questionnaires - factor analysis can investigate where multiple questions really just measure one thing
    • In contrast, with PCA, the features are actually being combined to model the data

Wrap up:

  • Logistic regression for churn
  • Survival analysis to prevent churn
  • Principal component analysis (PCA) to reduce multicollinearity

Example code includes:

load("./RInputFiles/newsData.RData")

rawData <- newsData
newsData <- newsData[, c('n_tokens_title', 'n_tokens_content', 'n_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'num_imgs', 'num_videos', 'num_keywords', 'is_weekend', 'kw_avg_min', 'kw_avg_avg', 'kw_avg_max', 'average_token_length', 'global_subjectivity', 'global_sentiment_polarity', 'global_rate_positive_words', 'global_rate_negative_words', 'avg_positive_polarity', 'avg_negative_polarity', 'title_subjectivity', 'title_sentiment_polarity')]


# Overview of data structure:
str(newsData, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    39644 obs. of  21 variables:
##  $ n_tokens_title            : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content          : num  219 255 211 531 1072 ...
##  $ n_unique_tokens           : num  0.664 0.605 0.575 0.504 0.416 ...
##  $ num_hrefs                 : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_self_hrefs            : num  2 1 1 0 19 2 20 20 0 1 ...
##  $ num_imgs                  : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ num_keywords              : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ is_weekend                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_min                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_avg                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_max                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ average_token_length      : num  4.68 4.91 4.39 4.4 4.68 ...
##  $ global_subjectivity       : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words: num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words: num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ avg_positive_polarity     : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ avg_negative_polarity     : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ title_subjectivity        : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity  : num  -0.188 0 0 0 0.136 ...
# Correlation structure:
newsData %>% cor() %>% corrplot::corrplot()

# Standardize data
newsData <- newsData %>% scale() %>% as.data.frame()

# Compute PCA
pcaNews <- newsData %>% prcomp()

# Eigenvalues
pcaNews$sdev**2
##  [1] 3.31015107 2.00241491 1.82662819 1.67421238 1.30249854 1.20443028
##  [7] 1.02889482 1.00052438 0.97929267 0.95905061 0.82676492 0.74951891
## [13] 0.73162009 0.66351863 0.62319656 0.57949073 0.47020594 0.41516936
## [19] 0.29926492 0.27690363 0.07624847
# Screeplot:
screeplot(pcaNews, type = "lines")

# Cumulative explained variance:
summary(pcaNews)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     1.8194 1.41507 1.35153 1.29391 1.14127 1.09747
## Proportion of Variance 0.1576 0.09535 0.08698 0.07972 0.06202 0.05735
## Cumulative Proportion  0.1576 0.25298 0.33996 0.41969 0.48171 0.53906
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     1.01434 1.00026 0.98959 0.97931 0.90927 0.86575
## Proportion of Variance 0.04899 0.04764 0.04663 0.04567 0.03937 0.03569
## Cumulative Proportion  0.58806 0.63570 0.68234 0.72800 0.76737 0.80307
##                           PC13   PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.85535 0.8146 0.78943 0.76124 0.68572 0.64434
## Proportion of Variance 0.03484 0.0316 0.02968 0.02759 0.02239 0.01977
## Cumulative Proportion  0.83790 0.8695 0.89918 0.92677 0.94916 0.96893
##                           PC19    PC20    PC21
## Standard deviation     0.54705 0.52622 0.27613
## Proportion of Variance 0.01425 0.01319 0.00363
## Cumulative Proportion  0.98318 0.99637 1.00000
# Kaiser-Guttmann (number of components with eigenvalue larger than 1):
sum(pcaNews$sdev > 1)
## [1] 8
# Print loadings of the first six components
pcaNews$rotation[, 1:6] %>% round(2)
##                              PC1   PC2   PC3   PC4   PC5   PC6
## n_tokens_title             -0.05 -0.10  0.01 -0.10  0.20 -0.28
## n_tokens_content            0.23 -0.17 -0.38  0.12  0.15 -0.02
## n_unique_tokens             0.00  0.00  0.00  0.01  0.01  0.06
## num_hrefs                   0.26 -0.16 -0.42 -0.03  0.07  0.11
## num_self_hrefs              0.20 -0.07 -0.39  0.06  0.12  0.08
## num_imgs                    0.14 -0.15 -0.43 -0.06  0.04  0.08
## num_videos                  0.09 -0.20  0.04 -0.19  0.16 -0.14
## num_keywords                0.07  0.11 -0.25  0.14 -0.42 -0.30
## is_weekend                  0.05 -0.01 -0.12 -0.02 -0.10 -0.16
## kw_avg_min                  0.03  0.01 -0.05 -0.25 -0.65  0.07
## kw_avg_avg                  0.02 -0.15 -0.06 -0.61 -0.31  0.17
## kw_avg_max                 -0.10 -0.21  0.10 -0.50  0.35  0.26
## average_token_length        0.39 -0.02  0.19  0.19 -0.01  0.14
## global_subjectivity         0.45 -0.01  0.23 -0.04 -0.03  0.03
## global_sentiment_polarity   0.25  0.55 -0.03 -0.19  0.11  0.13
## global_rate_positive_words  0.33  0.25  0.14 -0.08  0.04 -0.09
## global_rate_negative_words  0.15 -0.47  0.23  0.11 -0.10 -0.21
## avg_positive_polarity       0.42  0.09  0.17 -0.06  0.02  0.10
## avg_negative_polarity      -0.25  0.37 -0.20 -0.04  0.08  0.06
## title_subjectivity          0.07 -0.03  0.01 -0.27  0.07 -0.61
## title_sentiment_polarity    0.07  0.24 -0.11 -0.24  0.15 -0.42
pcaNews %>% biplot(choices=1:2, cex = 0.5)

# Predict log shares with all original variables
logShares <- rawData %>%
    select(shares) %>%
    mutate(logShares=log(1+shares)) %>%
    pull(logShares) %>%
    scale()

newsData <- newsData %>%
    cbind(logShares)

mod1 <- lm(logShares ~ ., data = newsData)

# Create dataframe with log shares and first 6 components
dataNewsComponents <- cbind(logShares = newsData[, "logShares"], pcaNews$x[, 1:6]) %>%
  as.data.frame()

# Predict log shares with first six components
mod2 <- lm(logShares ~ ., data = dataNewsComponents)

# Print adjusted R squared for both models
summary(mod1)$adj.r.squared
## [1] 0.07954578
summary(mod2)$adj.r.squared
## [1] 0.05066316

Interactive Maps with leaflet in R

Chapter 1 - Setting Up Interactive Web Maps

Introduction to leaflet - open-source JavaScript library that makes interactive, mobile-friendly maps:

  • Objective for this course is to build up to an interactive map of 4-year colleges, including incorporation type (public, private, etc.)
    • Additionally, labels that occur when hovering
  • Leaflet builds maps using tiles, which join many smaller maps together
    • library(leaflet)
    • leaflet() %>% addTiles() # zooming and scrolling lead to new tiles being shown
  • In Chapter 1, will use multiple tile types to create maps of the DataCamp HQ in Belgium and Boston
    • leaflet() %>% addProviderTiles(“CartoDB”) %>% addMarkers(lng = dc_hq\(lon, lat = dc_hq\)lat, popup = dc_hq$hq)

Map tiles - over 100 pre-canned maps that are available as bases:

  • Selecting a base map - consider the intended purpose of the map, and ensure that the maps selected meet that purpose
    • Instructor has a preference for gray-scale maps (for ease of seeing other data)
  • The base maps are stored as “providers” - most are available for immediate use, but a few require registration
    • names(providers) # get all the available providers
    • names(providers)[str_detect(names(providers), “OpenStreetMap”)] # all from OpenStreetMap
    • leaflet() %>% # addTiles() addProviderTiles(“OpenStreetMap.BlackAndWhite”) # replace the default with the BW OpenStreetMap

Setting the default map view:

  • Can load the map centered on a specific point and with a requested zoom level - coomon to use ggmap::geocode()
    • ggmap::geocode(“350 5th Ave, New York, NY 10118”) # will return the lat-lon where possible (uses google API unless source=“dsk” is chosen)
  • Can use either setView() or fitBounds()
    • leaflet() %>% addTiles() %>% setView(lng = -73.98575, lat = 40.74856, zoom = 13) # setView picks a lat/lon and zoom
    • leaflet() %>% addTiles() %>% fitBounds( lng1 = -73.910, lat1 = 40.773, lng2 = -74.060, lat2 = 40.723) # fitBounds defines a rectangle
  • Can limit user controls such as panning and zooming
    • leaflet(options = leafletOptions(dragging = FALSE, minZoom = 14, maxZoom = 18)) %>% addProviderTiles(“CartoDB”) %>% setView(lng = -73.98575, lat = 40.74856, zoom = 18)
    • dragging=FALSE removes the ability to pan
    • maxZoom and minZoom limit the options for zooming
    • leaflet() %>% addTiles() %>% setView(lng = -73.98575, lat = 40.74856, zoom = 18) %>% setMaxBounds(lng1 = -73.98575, lat1 = 40.74856, lng2 = -73.98575, lat2 = 40.74856)
    • setMaxBounds() limits the user to the boundaries that you pre-specify
  • For more information, can go to

Plotting DataCamp HQ:

  • Location markers are a common addition, managed using addMarkers()
    • leaflet() %>% addTiles() %>% addMarkers(lng = -73.98575, lat = 40.74856)
    • If single vectors are passed to lng and lat, then a single blue pin will be placed and the map will be centered/zoomed there
    • dc_hq <- tibble( hq = c(“DataCamp - NYC”, “DataCamp - Belgium”), lon = c(-73.98575, 4.717863), lat = c(40.74856, 50.881363))
    • leaflet() %>% addTiles() %>% addMarkers(lng = dc_hq\(lon, lat = dc_hq\)lat)
    • When the tibble is passed, then the map will be zoomed/centered such that all the pins can be displayed dc_hq %>% leaflet() %>% addTiles() %>% addMarkers()
    • The functions will seek a lat and lon column from the piped in data (dc_hq in this case), and pass along a note that they were used
  • Pop-ups are a common way to provide additional information about a marker
    • leaflet() %>% addTiles() %>% addMarkers(lng = dc_hq\(lon, lat = dc_hq\)lat, popup = dc_hq$hq) # markers, with popup enabled on clicking
    • leaflet() %>% addTiles() %>% addPopups(lng = dc_hq\(lon, lat = dc_hq\)lat, popup = dc_hq$hq) # popups instead of markers
  • Leaflets can be stored as objects (similar to ggplot2), with additions and prints and whatnot called later

Example code includes:

# Load the leaflet library
library(leaflet)
## 
## Attaching package: 'leaflet'
## The following object is masked from 'package:xts':
## 
##     addLegend
# Create a leaflet map with default map tile using addTiles()
leaflet() %>%
    addTiles()
# Print the providers list included in the leaflet library
providers
## $OpenStreetMap
## [1] "OpenStreetMap"
## 
## $OpenStreetMap.Mapnik
## [1] "OpenStreetMap.Mapnik"
## 
## $OpenStreetMap.BlackAndWhite
## [1] "OpenStreetMap.BlackAndWhite"
## 
## $OpenStreetMap.DE
## [1] "OpenStreetMap.DE"
## 
## $OpenStreetMap.CH
## [1] "OpenStreetMap.CH"
## 
## $OpenStreetMap.France
## [1] "OpenStreetMap.France"
## 
## $OpenStreetMap.HOT
## [1] "OpenStreetMap.HOT"
## 
## $OpenStreetMap.BZH
## [1] "OpenStreetMap.BZH"
## 
## $OpenInfraMap
## [1] "OpenInfraMap"
## 
## $OpenInfraMap.Power
## [1] "OpenInfraMap.Power"
## 
## $OpenInfraMap.Telecom
## [1] "OpenInfraMap.Telecom"
## 
## $OpenInfraMap.Petroleum
## [1] "OpenInfraMap.Petroleum"
## 
## $OpenInfraMap.Water
## [1] "OpenInfraMap.Water"
## 
## $OpenSeaMap
## [1] "OpenSeaMap"
## 
## $OpenPtMap
## [1] "OpenPtMap"
## 
## $OpenTopoMap
## [1] "OpenTopoMap"
## 
## $OpenRailwayMap
## [1] "OpenRailwayMap"
## 
## $OpenFireMap
## [1] "OpenFireMap"
## 
## $SafeCast
## [1] "SafeCast"
## 
## $Thunderforest
## [1] "Thunderforest"
## 
## $Thunderforest.OpenCycleMap
## [1] "Thunderforest.OpenCycleMap"
## 
## $Thunderforest.Transport
## [1] "Thunderforest.Transport"
## 
## $Thunderforest.TransportDark
## [1] "Thunderforest.TransportDark"
## 
## $Thunderforest.SpinalMap
## [1] "Thunderforest.SpinalMap"
## 
## $Thunderforest.Landscape
## [1] "Thunderforest.Landscape"
## 
## $Thunderforest.Outdoors
## [1] "Thunderforest.Outdoors"
## 
## $Thunderforest.Pioneer
## [1] "Thunderforest.Pioneer"
## 
## $OpenMapSurfer
## [1] "OpenMapSurfer"
## 
## $OpenMapSurfer.Roads
## [1] "OpenMapSurfer.Roads"
## 
## $OpenMapSurfer.AdminBounds
## [1] "OpenMapSurfer.AdminBounds"
## 
## $OpenMapSurfer.Grayscale
## [1] "OpenMapSurfer.Grayscale"
## 
## $Hydda
## [1] "Hydda"
## 
## $Hydda.Full
## [1] "Hydda.Full"
## 
## $Hydda.Base
## [1] "Hydda.Base"
## 
## $Hydda.RoadsAndLabels
## [1] "Hydda.RoadsAndLabels"
## 
## $MapBox
## [1] "MapBox"
## 
## $Stamen
## [1] "Stamen"
## 
## $Stamen.Toner
## [1] "Stamen.Toner"
## 
## $Stamen.TonerBackground
## [1] "Stamen.TonerBackground"
## 
## $Stamen.TonerHybrid
## [1] "Stamen.TonerHybrid"
## 
## $Stamen.TonerLines
## [1] "Stamen.TonerLines"
## 
## $Stamen.TonerLabels
## [1] "Stamen.TonerLabels"
## 
## $Stamen.TonerLite
## [1] "Stamen.TonerLite"
## 
## $Stamen.Watercolor
## [1] "Stamen.Watercolor"
## 
## $Stamen.Terrain
## [1] "Stamen.Terrain"
## 
## $Stamen.TerrainBackground
## [1] "Stamen.TerrainBackground"
## 
## $Stamen.TopOSMRelief
## [1] "Stamen.TopOSMRelief"
## 
## $Stamen.TopOSMFeatures
## [1] "Stamen.TopOSMFeatures"
## 
## $Esri
## [1] "Esri"
## 
## $Esri.WorldStreetMap
## [1] "Esri.WorldStreetMap"
## 
## $Esri.DeLorme
## [1] "Esri.DeLorme"
## 
## $Esri.WorldTopoMap
## [1] "Esri.WorldTopoMap"
## 
## $Esri.WorldImagery
## [1] "Esri.WorldImagery"
## 
## $Esri.WorldTerrain
## [1] "Esri.WorldTerrain"
## 
## $Esri.WorldShadedRelief
## [1] "Esri.WorldShadedRelief"
## 
## $Esri.WorldPhysical
## [1] "Esri.WorldPhysical"
## 
## $Esri.OceanBasemap
## [1] "Esri.OceanBasemap"
## 
## $Esri.NatGeoWorldMap
## [1] "Esri.NatGeoWorldMap"
## 
## $Esri.WorldGrayCanvas
## [1] "Esri.WorldGrayCanvas"
## 
## $OpenWeatherMap
## [1] "OpenWeatherMap"
## 
## $OpenWeatherMap.Clouds
## [1] "OpenWeatherMap.Clouds"
## 
## $OpenWeatherMap.CloudsClassic
## [1] "OpenWeatherMap.CloudsClassic"
## 
## $OpenWeatherMap.Precipitation
## [1] "OpenWeatherMap.Precipitation"
## 
## $OpenWeatherMap.PrecipitationClassic
## [1] "OpenWeatherMap.PrecipitationClassic"
## 
## $OpenWeatherMap.Rain
## [1] "OpenWeatherMap.Rain"
## 
## $OpenWeatherMap.RainClassic
## [1] "OpenWeatherMap.RainClassic"
## 
## $OpenWeatherMap.Pressure
## [1] "OpenWeatherMap.Pressure"
## 
## $OpenWeatherMap.PressureContour
## [1] "OpenWeatherMap.PressureContour"
## 
## $OpenWeatherMap.Wind
## [1] "OpenWeatherMap.Wind"
## 
## $OpenWeatherMap.Temperature
## [1] "OpenWeatherMap.Temperature"
## 
## $OpenWeatherMap.Snow
## [1] "OpenWeatherMap.Snow"
## 
## $HERE
## [1] "HERE"
## 
## $HERE.normalDay
## [1] "HERE.normalDay"
## 
## $HERE.normalDayCustom
## [1] "HERE.normalDayCustom"
## 
## $HERE.normalDayGrey
## [1] "HERE.normalDayGrey"
## 
## $HERE.normalDayMobile
## [1] "HERE.normalDayMobile"
## 
## $HERE.normalDayGreyMobile
## [1] "HERE.normalDayGreyMobile"
## 
## $HERE.normalDayTransit
## [1] "HERE.normalDayTransit"
## 
## $HERE.normalDayTransitMobile
## [1] "HERE.normalDayTransitMobile"
## 
## $HERE.normalNight
## [1] "HERE.normalNight"
## 
## $HERE.normalNightMobile
## [1] "HERE.normalNightMobile"
## 
## $HERE.normalNightGrey
## [1] "HERE.normalNightGrey"
## 
## $HERE.normalNightGreyMobile
## [1] "HERE.normalNightGreyMobile"
## 
## $HERE.basicMap
## [1] "HERE.basicMap"
## 
## $HERE.mapLabels
## [1] "HERE.mapLabels"
## 
## $HERE.trafficFlow
## [1] "HERE.trafficFlow"
## 
## $HERE.carnavDayGrey
## [1] "HERE.carnavDayGrey"
## 
## $HERE.hybridDay
## [1] "HERE.hybridDay"
## 
## $HERE.hybridDayMobile
## [1] "HERE.hybridDayMobile"
## 
## $HERE.pedestrianDay
## [1] "HERE.pedestrianDay"
## 
## $HERE.pedestrianNight
## [1] "HERE.pedestrianNight"
## 
## $HERE.satelliteDay
## [1] "HERE.satelliteDay"
## 
## $HERE.terrainDay
## [1] "HERE.terrainDay"
## 
## $HERE.terrainDayMobile
## [1] "HERE.terrainDayMobile"
## 
## $FreeMapSK
## [1] "FreeMapSK"
## 
## $MtbMap
## [1] "MtbMap"
## 
## $CartoDB
## [1] "CartoDB"
## 
## $CartoDB.Positron
## [1] "CartoDB.Positron"
## 
## $CartoDB.PositronNoLabels
## [1] "CartoDB.PositronNoLabels"
## 
## $CartoDB.PositronOnlyLabels
## [1] "CartoDB.PositronOnlyLabels"
## 
## $CartoDB.DarkMatter
## [1] "CartoDB.DarkMatter"
## 
## $CartoDB.DarkMatterNoLabels
## [1] "CartoDB.DarkMatterNoLabels"
## 
## $CartoDB.DarkMatterOnlyLabels
## [1] "CartoDB.DarkMatterOnlyLabels"
## 
## $HikeBike
## [1] "HikeBike"
## 
## $HikeBike.HikeBike
## [1] "HikeBike.HikeBike"
## 
## $HikeBike.HillShading
## [1] "HikeBike.HillShading"
## 
## $BasemapAT
## [1] "BasemapAT"
## 
## $BasemapAT.basemap
## [1] "BasemapAT.basemap"
## 
## $BasemapAT.grau
## [1] "BasemapAT.grau"
## 
## $BasemapAT.overlay
## [1] "BasemapAT.overlay"
## 
## $BasemapAT.highdpi
## [1] "BasemapAT.highdpi"
## 
## $BasemapAT.orthofoto
## [1] "BasemapAT.orthofoto"
## 
## $nlmaps
## [1] "nlmaps"
## 
## $nlmaps.standaard
## [1] "nlmaps.standaard"
## 
## $nlmaps.pastel
## [1] "nlmaps.pastel"
## 
## $nlmaps.grijs
## [1] "nlmaps.grijs"
## 
## $nlmaps.luchtfoto
## [1] "nlmaps.luchtfoto"
## 
## $NASAGIBS
## [1] "NASAGIBS"
## 
## $NASAGIBS.ModisTerraTrueColorCR
## [1] "NASAGIBS.ModisTerraTrueColorCR"
## 
## $NASAGIBS.ModisTerraBands367CR
## [1] "NASAGIBS.ModisTerraBands367CR"
## 
## $NASAGIBS.ViirsEarthAtNight2012
## [1] "NASAGIBS.ViirsEarthAtNight2012"
## 
## $NASAGIBS.ModisTerraLSTDay
## [1] "NASAGIBS.ModisTerraLSTDay"
## 
## $NASAGIBS.ModisTerraSnowCover
## [1] "NASAGIBS.ModisTerraSnowCover"
## 
## $NASAGIBS.ModisTerraAOD
## [1] "NASAGIBS.ModisTerraAOD"
## 
## $NASAGIBS.ModisTerraChlorophyll
## [1] "NASAGIBS.ModisTerraChlorophyll"
## 
## $NLS
## [1] "NLS"
## 
## $JusticeMap
## [1] "JusticeMap"
## 
## $JusticeMap.income
## [1] "JusticeMap.income"
## 
## $JusticeMap.americanIndian
## [1] "JusticeMap.americanIndian"
## 
## $JusticeMap.asian
## [1] "JusticeMap.asian"
## 
## $JusticeMap.black
## [1] "JusticeMap.black"
## 
## $JusticeMap.hispanic
## [1] "JusticeMap.hispanic"
## 
## $JusticeMap.multi
## [1] "JusticeMap.multi"
## 
## $JusticeMap.nonWhite
## [1] "JusticeMap.nonWhite"
## 
## $JusticeMap.white
## [1] "JusticeMap.white"
## 
## $JusticeMap.plurality
## [1] "JusticeMap.plurality"
## 
## $Wikimedia
## [1] "Wikimedia"
# Print only the names of the map tiles in the providers list 
names(providers)
##   [1] "OpenStreetMap"                      
##   [2] "OpenStreetMap.Mapnik"               
##   [3] "OpenStreetMap.BlackAndWhite"        
##   [4] "OpenStreetMap.DE"                   
##   [5] "OpenStreetMap.CH"                   
##   [6] "OpenStreetMap.France"               
##   [7] "OpenStreetMap.HOT"                  
##   [8] "OpenStreetMap.BZH"                  
##   [9] "OpenInfraMap"                       
##  [10] "OpenInfraMap.Power"                 
##  [11] "OpenInfraMap.Telecom"               
##  [12] "OpenInfraMap.Petroleum"             
##  [13] "OpenInfraMap.Water"                 
##  [14] "OpenSeaMap"                         
##  [15] "OpenPtMap"                          
##  [16] "OpenTopoMap"                        
##  [17] "OpenRailwayMap"                     
##  [18] "OpenFireMap"                        
##  [19] "SafeCast"                           
##  [20] "Thunderforest"                      
##  [21] "Thunderforest.OpenCycleMap"         
##  [22] "Thunderforest.Transport"            
##  [23] "Thunderforest.TransportDark"        
##  [24] "Thunderforest.SpinalMap"            
##  [25] "Thunderforest.Landscape"            
##  [26] "Thunderforest.Outdoors"             
##  [27] "Thunderforest.Pioneer"              
##  [28] "OpenMapSurfer"                      
##  [29] "OpenMapSurfer.Roads"                
##  [30] "OpenMapSurfer.AdminBounds"          
##  [31] "OpenMapSurfer.Grayscale"            
##  [32] "Hydda"                              
##  [33] "Hydda.Full"                         
##  [34] "Hydda.Base"                         
##  [35] "Hydda.RoadsAndLabels"               
##  [36] "MapBox"                             
##  [37] "Stamen"                             
##  [38] "Stamen.Toner"                       
##  [39] "Stamen.TonerBackground"             
##  [40] "Stamen.TonerHybrid"                 
##  [41] "Stamen.TonerLines"                  
##  [42] "Stamen.TonerLabels"                 
##  [43] "Stamen.TonerLite"                   
##  [44] "Stamen.Watercolor"                  
##  [45] "Stamen.Terrain"                     
##  [46] "Stamen.TerrainBackground"           
##  [47] "Stamen.TopOSMRelief"                
##  [48] "Stamen.TopOSMFeatures"              
##  [49] "Esri"                               
##  [50] "Esri.WorldStreetMap"                
##  [51] "Esri.DeLorme"                       
##  [52] "Esri.WorldTopoMap"                  
##  [53] "Esri.WorldImagery"                  
##  [54] "Esri.WorldTerrain"                  
##  [55] "Esri.WorldShadedRelief"             
##  [56] "Esri.WorldPhysical"                 
##  [57] "Esri.OceanBasemap"                  
##  [58] "Esri.NatGeoWorldMap"                
##  [59] "Esri.WorldGrayCanvas"               
##  [60] "OpenWeatherMap"                     
##  [61] "OpenWeatherMap.Clouds"              
##  [62] "OpenWeatherMap.CloudsClassic"       
##  [63] "OpenWeatherMap.Precipitation"       
##  [64] "OpenWeatherMap.PrecipitationClassic"
##  [65] "OpenWeatherMap.Rain"                
##  [66] "OpenWeatherMap.RainClassic"         
##  [67] "OpenWeatherMap.Pressure"            
##  [68] "OpenWeatherMap.PressureContour"     
##  [69] "OpenWeatherMap.Wind"                
##  [70] "OpenWeatherMap.Temperature"         
##  [71] "OpenWeatherMap.Snow"                
##  [72] "HERE"                               
##  [73] "HERE.normalDay"                     
##  [74] "HERE.normalDayCustom"               
##  [75] "HERE.normalDayGrey"                 
##  [76] "HERE.normalDayMobile"               
##  [77] "HERE.normalDayGreyMobile"           
##  [78] "HERE.normalDayTransit"              
##  [79] "HERE.normalDayTransitMobile"        
##  [80] "HERE.normalNight"                   
##  [81] "HERE.normalNightMobile"             
##  [82] "HERE.normalNightGrey"               
##  [83] "HERE.normalNightGreyMobile"         
##  [84] "HERE.basicMap"                      
##  [85] "HERE.mapLabels"                     
##  [86] "HERE.trafficFlow"                   
##  [87] "HERE.carnavDayGrey"                 
##  [88] "HERE.hybridDay"                     
##  [89] "HERE.hybridDayMobile"               
##  [90] "HERE.pedestrianDay"                 
##  [91] "HERE.pedestrianNight"               
##  [92] "HERE.satelliteDay"                  
##  [93] "HERE.terrainDay"                    
##  [94] "HERE.terrainDayMobile"              
##  [95] "FreeMapSK"                          
##  [96] "MtbMap"                             
##  [97] "CartoDB"                            
##  [98] "CartoDB.Positron"                   
##  [99] "CartoDB.PositronNoLabels"           
## [100] "CartoDB.PositronOnlyLabels"         
## [101] "CartoDB.DarkMatter"                 
## [102] "CartoDB.DarkMatterNoLabels"         
## [103] "CartoDB.DarkMatterOnlyLabels"       
## [104] "HikeBike"                           
## [105] "HikeBike.HikeBike"                  
## [106] "HikeBike.HillShading"               
## [107] "BasemapAT"                          
## [108] "BasemapAT.basemap"                  
## [109] "BasemapAT.grau"                     
## [110] "BasemapAT.overlay"                  
## [111] "BasemapAT.highdpi"                  
## [112] "BasemapAT.orthofoto"                
## [113] "nlmaps"                             
## [114] "nlmaps.standaard"                   
## [115] "nlmaps.pastel"                      
## [116] "nlmaps.grijs"                       
## [117] "nlmaps.luchtfoto"                   
## [118] "NASAGIBS"                           
## [119] "NASAGIBS.ModisTerraTrueColorCR"     
## [120] "NASAGIBS.ModisTerraBands367CR"      
## [121] "NASAGIBS.ViirsEarthAtNight2012"     
## [122] "NASAGIBS.ModisTerraLSTDay"          
## [123] "NASAGIBS.ModisTerraSnowCover"       
## [124] "NASAGIBS.ModisTerraAOD"             
## [125] "NASAGIBS.ModisTerraChlorophyll"     
## [126] "NLS"                                
## [127] "JusticeMap"                         
## [128] "JusticeMap.income"                  
## [129] "JusticeMap.americanIndian"          
## [130] "JusticeMap.asian"                   
## [131] "JusticeMap.black"                   
## [132] "JusticeMap.hispanic"                
## [133] "JusticeMap.multi"                   
## [134] "JusticeMap.nonWhite"                
## [135] "JusticeMap.white"                   
## [136] "JusticeMap.plurality"               
## [137] "Wikimedia"
# Use str_detect() to determine if the name of each provider tile contains the string "CartoDB"
str_detect(names(providers), "CartoDB")
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
## [100]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE
# Use str_detect() to print only the provider tile names that include the string "CartoDB"
names(providers)[str_detect(names(providers), "CartoDB")]
## [1] "CartoDB"                      "CartoDB.Positron"            
## [3] "CartoDB.PositronNoLabels"     "CartoDB.PositronOnlyLabels"  
## [5] "CartoDB.DarkMatter"           "CartoDB.DarkMatterNoLabels"  
## [7] "CartoDB.DarkMatterOnlyLabels"
# Change addTiles() to addProviderTiles() and set the provider argument to "CartoDB"
leaflet() %>% 
    addProviderTiles("CartoDB")
# Create a leaflet map that uses the Esri provider tile 
leaflet() %>% 
    addProviderTiles("Esri")
# Create a leaflet map that uses the CartoDB.PositronNoLabels provider tile
leaflet() %>% 
    addProviderTiles("CartoDB.PositronNoLabels")
# Map with CartoDB tile centered on DataCamp's NYC office with zoom of 6
leaflet()  %>% 
    addProviderTiles("CartoDB")  %>% 
    setView(lng = -73.98575, lat = 40.74856, zoom = 6)
dc_hq <- tibble::tibble(hq=c("NYC", "Belgium"), lon=c(-73.98575, 4.71786), lat=c(40.7486, 50.8814))
dc_hq
## # A tibble: 2 x 3
##   hq         lon   lat
##   <chr>    <dbl> <dbl>
## 1 NYC     -74.0   40.7
## 2 Belgium   4.72  50.9
# Map with CartoDB.PositronNoLabels tile centered on DataCamp's Belgium office with zoom of 4
leaflet() %>% 
    addProviderTiles("CartoDB.PositronNoLabels") %>% 
    setView(lng = dc_hq$lon[2], lat = dc_hq$lat[2], zoom = 4)
leaflet(options = leafletOptions(
                    # Set minZoom and dragging 
                    minZoom = 12, dragging = TRUE))  %>% 
  addProviderTiles("CartoDB")  %>% 
  # Set default zoom level 
  setView(lng = dc_hq$lon[2], lat = dc_hq$lat[2], zoom = 14) %>% 
  # Set max bounds of map 
  setMaxBounds(lng1 = dc_hq$lon[2] + 0.05, 
               lat1 = dc_hq$lat[2] + .05, 
               lng2 = dc_hq$lon[2] - 0.05, 
               lat2 = dc_hq$lat[2] - .05) 
# Plot DataCamp's NYC HQ
leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addMarkers(lng = dc_hq$lon[1], lat = dc_hq$lat[1])
# Plot DataCamp's NYC HQ with zoom of 12    
leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addMarkers(lng = -73.98575, lat = 40.74856)  %>% 
    setView(lng = -73.98575, lat = 40.74856, zoom = 12)    
# Plot both DataCamp's NYC and Belgium locations
leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addMarkers(lng = dc_hq$lon, lat = dc_hq$lat)
# Store leaflet hq map in an object called map
map <- leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    # add hq column of dc_hq as popups
    addMarkers(lng = dc_hq$lon, lat = dc_hq$lat, 
               popup = dc_hq$hq
               )

# Center the view of map on the Belgium HQ with a zoom of 5  
map_zoom <- map %>% 
      setView(lat = 50.881363, lng = 4.717863, zoom = 5)

# Print map_zoom
map_zoom

Chapter 2 - Plotting points

Introduction to IPEDS Data:

  • Can clear the boundaries of a map, while keeping everything else (data and the like) constant
    • m %>% clearBounds() # kills the bounds layers
    • m %>% clearBounds() %>% clearMarkers() # kills the markers layers
  • The IPEDS data is the Integrated Post-Secondary Education dataset - this course uses a subset consisting of 4-year colleges
    • Goal is to create a subset of the IPEDS data consisting of the ~300 colleges in California
    • Can then plot and color-code the California colleges

Mapping California colleges:

  • Clustered markers are poorly shown by pins due to obscuring
  • A nice alternative is to use circle markers, which have much less tendency for overlaps
    • maine_colleges_map %>% clearMarkers() %>% addCircleMarkers(data = maine, radius = 3)
    • maine_colleges_map %>% addCircleMarkers( data = maine_colleges, radius = 4, color = “red”, popup = ~name) # custom color and radius while maintaining popups

Labels and pop-ups:

  • Can use piping as well as the tilde, which allows for referring to key variables in the piped in data
    • ipeds %>% leaflet() %>% addProviderTiles(“CartoDB”) %>% addCircleMarkers( lng = ~lng, lat = ~lat, popup = ~name, color = “#FF0000”)
    • Colors can be specified using hexadecimal, as shown in the example above - can find these using google and color sliders
  • Can build better popups using pipes and tildes
    • addCircleMarkers(popup = ~paste0(name, “-”, sector_label)
    • addCircleMarkers(popup = ~paste0(“”,name,“”,“
      ”,sector_label)) # enhanced with html tags
  • Labels provide similar information as pop-ups, but require only a hover rather than a click
    • ipeds %>% leaflet() %>% addProviderTiles(“CartoDB”) %>% addCircleMarkers(label = ~name, radius = 2)

Color coding colleges:

  • Can include differential colors depending on a variables that has been piped in using colorFactor()
    • OR <- ipeds %>% filter(state == “OR”)
    • pal <- colorFactor(palette = c(“red”, “blue”, “#9b4a11”), levels = c(“Public”, “Private”, “For-Profit”)) # create the color palette for future use
    • oregon_colleges <- OR %>% leaflet() %>% addProviderTiles(“CartoDB”) %>% addCircleMarkers(radius = 2, color = ~pal(sector_label), label = ~name) # apply as pal()
    • oregon_colleges %>% addLegend(position = “bottomright”, pal = pal, values = c(“Public”, “Private”, “For-Profit”)) # add to legend
  • Can instead color based on a numeric value using colorNumeric()
    • admit <- admit %>% filter(!is.na(rate), rate < 50, rate > 0) # filer for rates that exist and are between 0 and 50
    • pal <- colorNumeric(palette = “Reds”, domain = c(1:50), reverse = TRUE) # reverse=TRUE flips the gradient so that lower admit rates are darker red
    • admit_map <- admit %>% leaflet() %>% addProviderTiles(“CartoDB”) %>% addCircleMarkers(radius = 4, color = ~pal(rate), label = ~name) %>% addLegend(title = “Admit Rate”, pal = pal, values = c(1:50), position = “bottomright”)
  • Can use RColorBrewer for default color palettes
    • library(RColorBrewer)
    • display.brewer.all()

Example code includes:

# Remove markers, reset bounds, and store the updated map in the m object
map <- map %>%
    clearMarkers() %>% 
    clearBounds()

# Print the cleared map
map
ipedsRaw <- readr::read_csv("./RInputFiles/ipeds.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   lng = col_double(),
##   lat = col_double(),
##   state = col_character(),
##   sector_label = col_character()
## )
# Remove colleges with missing sector information
ipeds <- 
    ipedsRaw %>% 
    tidyr::drop_na()

# Count the number of four-year colleges in each state
ipeds %>% 
    group_by(state)  %>% 
    count()
## # A tibble: 56 x 2
## # Groups:   state [56]
##    state     n
##    <chr> <int>
##  1 AK        6
##  2 AL       45
##  3 AR       26
##  4 AS        1
##  5 AZ       50
##  6 CA      272
##  7 CO       53
##  8 CT       33
##  9 DC       18
## 10 DE        7
## # ... with 46 more rows
# Create a list of US States in descending order by the number of colleges in each state
ipeds  %>% 
    group_by(state)  %>% 
    count()  %>% 
    arrange(desc(n))
## # A tibble: 56 x 2
## # Groups:   state [56]
##    state     n
##    <chr> <int>
##  1 CA      272
##  2 NY      239
##  3 PA      164
##  4 FL      159
##  5 TX      154
##  6 OH      135
##  7 IL      119
##  8 MA      103
##  9 MO       87
## 10 MN       82
## # ... with 46 more rows
# Create a dataframe called `ca` with data on only colleges in California
ca <- ipeds %>%
    filter(state == "CA")

map <- leaflet() %>% 
    addProviderTiles("CartoDB")

# Use `addMarkers` to plot all of the colleges in `ca` on the `m` leaflet map
map %>%
    addMarkers(lng = ca$lng, lat = ca$lat)
la_coords <- data.frame(lat = 34.05223, lon = -118.2437) 

# Center the map on LA 
map %>% 
    addMarkers(data = ca) %>% 
    setView(lat = la_coords$lat, lng = la_coords$lon, zoom = 12)
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Set the zoom level to 8 and store in the m object
map_zoom <-
    map %>%
    addMarkers(data = ca) %>%
    setView(lat = la_coords$lat, lng = la_coords$lon, zoom = 8)
## Assuming "lng" and "lat" are longitude and latitude, respectively
map_zoom
# Clear the markers from the map 
map2 <- map %>% clearMarkers()

# Use addCircleMarkers() to plot each college as a circle
map2 %>%
    addCircleMarkers(lng = ca$lng, lat = ca$lat)
# Change the radius of each circle to be 2 pixels and the color to red
map2 %>% 
    addCircleMarkers(lng = ca$lng, lat = ca$lat, radius = 2, color = "red")
# Add circle markers with popups for college names
map %>%
    addCircleMarkers(data = ca, radius = 2, popup = ~name)
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Change circle color to #2cb42c and store map in map_color object
map_color <- map %>% 
    addCircleMarkers(data = ca, radius = 2, color = "#2cb42c", popup = ~name)
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Print map_color
map_color
# Clear the bounds and markers on the map object and store in map2
map2 <- map %>% 
    clearBounds() %>% 
    clearMarkers()

# Add circle markers with popups that display both the institution name and sector
map2 %>% 
    addCircleMarkers(data = ca, radius = 2, 
                     popup = ~paste0(name, "<br/>", sector_label)
                     )
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Make the institution name in each popup bold
map2 %>% 
    addCircleMarkers(data = ca, radius = 2, 
                     popup = ~paste0("<b>", name, "</b>", "<br/>", sector_label)
                     )
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Add circle markers with labels identifying the name of each college
map %>% 
    addCircleMarkers(data = ca, radius = 2, label = ~name)
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Use paste0 to add sector information to the label inside parentheses 
map %>% 
    addCircleMarkers(data = ca, radius = 2, label = ~paste0(name, " (", sector_label, ")"))
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Make a color palette called pal for the values of `sector_label` using `colorFactor()`  
# Colors should be: "red", "blue", and "#9b4a11" for "Public", "Private", and "For-Profit" colleges, respectively
pal <- colorFactor(palette = c("red", "blue", "#9b4a11"), 
                   levels = c("Public", "Private", "For-Profit")
                   )

# Add circle markers that color colleges using pal() and the values of sector_label
map2 <- map %>% 
        addCircleMarkers(data = ca, radius = 2, 
                         color = ~pal(sector_label), 
                         label = ~paste0(name, " (", sector_label, ")")
                         )
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Print map2
map2
# Add a legend that displays the colors used in pal
map2 %>% 
    addLegend(pal = pal, values = c("Public", "Private", "For-Profit"))
# Customize the legend
map2 %>% 
    addLegend(pal = pal, 
              values = c("Public", "Private", "For-Profit"),
              # opacity of .5, title of Sector, and position of topright
              opacity = 0.5, title = "Sector", position = "topright"
              )

Chapter 3 - Groups, Layers, Extras

Leaflet Extras Package:

  • The leaflet.extras package provides some nice extensibility to the baseline leaflet package
    • leaflet() %>% addTiles() %>% addSearchOSM() # searching open-source-maps (magnifying glass icon with search box)
    • leaflet() %>% addTiles() %>% addSearchOSM() %>% addReverseSearchOSM() # can also use geocode to find a click, as requested by addReverseSearchOSM()
    • leaflet() %>% addTiles() %>% addSearchOSM() %>% addReverseSearchOSM() %>% addResetMapButton() # can click “reset” to return to the default view

Overlay Groups - ability to control the segments that are displayed on the map:

  • One option is to segment the data in advance, then to add as layers using addCircleMarkers
    • ca_public <- ipeds %>% filter(sector == “Public”, state == “CA”)
    • m %>% addCircleMarkers( data = ca_public, group = “Public”)
  • After creating multiple calls for addCircleMarkers(), each with group=, can then activate the grouping
    • addLayersControl( overlayGroups = c(“Public”, “Private”, “For-Profit”))
  • Since the layers are stacked, the order in which they are added matters (they layer/stack on top of each other)

Base Groups - can provide multiple options for toggling (only one may be selected at a time):

  • Need to call addProviderTiles() once for each layer that is an option, then activate using addLayersControl()
    • a <- leaflet() %>% addTiles(group = “OSM”) %>% addProviderTiles(“CartoDB”, group = “Carto”) %>% addProviderTiles(“Esri”, group = “Esri”)
    • a %>% addLayersControl(baseGroups = c(“OSM”, “Carto”, “Esri”), position = “topleft”)
  • Can be handy to try a few different base groups during exploratory analysis, to find the base that best matches the rest of the analysis
  • Basic four-step process for building up the base groups includes
    • leaflet() %>% # initialize leaflet map
    • addTiles(group = “OSM”) %>% addProviderTiles(“CartoDB”, group = “Carto”) %>% addProviderTiles(“Esri”, group = “Esri”) %>% # add basemaps with groups
    • addCircleMarkers(data = public, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “Public”) %>% addCircleMarkers(data = private, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “Private”) %>% addCircleMarkers(data = profit, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “For-Profit”) %>% # add marker layer for each sector with corresponding group name
    • addLayersControl(baseGroups = c(“OSM”, “Carto”, “Esri”), overlayGroups = c(“Public”, “Private”, “For-Profit”)) # add layer controls for base and overlay groups

Pieces of Flair:

  • Can customize ths search function using leaflet.extra capability
    • ca_public <- ipeds %>% filter(sector_label == “Public”, state == “CA”)
    • ca_public %>% leaflet() %>% addProviderTiles(“Esri”) %>% addCircleMarkers(radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “Public”) %>% addSearchFeatures(targetGroups = ‘Public’, options = searchFeaturesOptions(zoom = 10)) # will filter the search on Public data, with a specified zoom
  • Can cluster the colleges to improve readability of the maps
    • ipeds %>% leaflet() %>% addTiles() %>% addCircleMarkers(radius = 2, color = ~pal(sector_label), clusterOptions = markerClusterOptions()) # many colleges in one circle

Example code includes:

library(leaflet.extras)
library(htmltools)

leaflet() %>%
  addTiles() %>% 
  addSearchOSM() %>% 
  addReverseSearchOSM() 
m2 <- ipeds %>% 
    leaflet() %>% 
    # use the CartoDB provider tile
    addProviderTiles("CartoDB") %>% 
    # center on the middle of the US with zoom of 3
    setView(lat = 39.8282, lng = -98.5795, zoom=3)

# Map all American colleges 
m2 %>% 
    addCircleMarkers() 
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Create data frame called public with only public colleges
public <- filter(ipeds, sector_label == "Public")  

# Create a leaflet map of public colleges called m3 
m3 <- leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addCircleMarkers(data = public, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label), group = "Public"
                     )
## Assuming "lng" and "lat" are longitude and latitude, respectively
m3
# Create data frame called private with only private colleges
private <- filter(ipeds, sector_label == "Private")  

# Add private colleges to `m3` as a new layer
m3 <- m3 %>% 
    addCircleMarkers(data = private, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label), group = "Private"
                     ) %>% 
    addLayersControl(overlayGroups = c("Public", "Private"))
## Assuming "lng" and "lat" are longitude and latitude, respectively
m3
# Create data frame called profit with only for-profit colleges
profit <- filter(ipeds, sector_label == "For-Profit")  

# Add for-profit colleges to `m3` as a new layer
m3 <- m3 %>% 
    addCircleMarkers(data = profit, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label),   group = "For-Profit"
                     )  %>% 
    addLayersControl(overlayGroups = c("Public", "Private", "For-Profit"))  
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Center the map on the middle of the US with a zoom of 4
m4 <- m3 %>%
    setView(lat = 39.8282, lng = -98.5795, zoom = 4) 
        
m4
leaflet() %>% 
  # Add the OSM, CartoDB and Esri tiles
  addTiles(group = "OSM") %>% 
  addProviderTiles("CartoDB", group = "Carto") %>% 
  addProviderTiles("Esri", group = "Esri") %>% 
  # Use addLayersControl to allow users to toggle between basemaps
  addLayersControl(baseGroups = c("OSM", "Carto", "Esri"))
m4 <- leaflet() %>% 
    addTiles(group = "OSM") %>% 
    addProviderTiles("CartoDB", group = "Carto") %>% 
    addProviderTiles("Esri", group = "Esri") %>% 
    addCircleMarkers(data = public, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label),  group = "Public"
                     ) %>% 
    addCircleMarkers(data = private, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label), group = "Private"
                     )  %>% 
    addCircleMarkers(data = profit, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label), group = "For-Profit"
                     )  %>% 
    addLayersControl(baseGroups = c("OSM", "Carto", "Esri"), 
                     overlayGroups = c("Public", "Private", "For-Profit")
                     ) %>% 
    setView(lat = 39.8282, lng = -98.5795, zoom = 4) 
## Assuming "lng" and "lat" are longitude and latitude, respectively
## Assuming "lng" and "lat" are longitude and latitude, respectively
## Assuming "lng" and "lat" are longitude and latitude, respectively
m4
ipeds %>% 
    leaflet() %>% 
    addTiles() %>% 
    # Sanitize any html in our labels
    addCircleMarkers(radius = 2, label = ~htmlEscape(name), 
                     # Color code colleges by sector using the `pal` color palette 
                     color = ~pal(sector_label), 
                     # Cluster all colleges using `clusterOptions` 
                     clusterOptions = markerClusterOptions()
                     ) 
## Assuming "lng" and "lat" are longitude and latitude, respectively

Chapter 4 - Plotting Polygons

Spatial Data - ability to plot polygons rather than points:

  • Polygons have many points, and are stored in SPDF (Spatial Polygons Data Frame) with 5 slots
    • data - one observation per polygon
    • polygons - coordinates to plot each polygon
    • plotOrder - order for plotting
    • bbox - rectangle containing all the polygons
    • proj4string - coordinate reference system (CRS)
    • All accessed using the @ symbol
  • Can join from the data component of the SPDF, accessed using @
    • hp@data <- shp@data %>% left_join(nc_income, by = c(“GEOID10” = “zipcode”))
    • shp@polygons[[1]] %>% leaflet() %>% addPolygons() # can plot a single polygon

Mapping Polygons - can pipe SPDF in to a series of leaflet calls:

  • The basic polygon plotting method using leaflet() may produce shape boundaries that are too thick
    • shp %>% leaflet() %>% addTiles() %>% addPolygons()
    • weight - thickness of lines
    • color - color of lines
    • label - information shown on hover
    • highlight - options to highlight polygon on hover
  • The refined plotting approach adds customization for better readability
    • shp %>% leaflet() %>% addTiles() %>% addPolygons(weight = 1, color = “grey”, label = ~paste0(“Total Income:” dollar(income)), highlight = highlightOptions(weight = 3, color = “red”, bringToFront = TRUE))
  • Can color numeric data when plotting polygons
    • colorNumeric - maps continuous data to interpolated palettes
    • colorBin - colors based on cut function
    • colorQuantile - colors based on quantile
    • nc_pal <- colorNumeric(palette = “Blues”, domain = high_inc@data$mean_income)
    • nc_pal <- colorBin(palette = “YlGn”, bins = 5, domain = high_inc@data$mean_income)
    • nc_pal <- colorQuantile(palette = “YlGn”, n = 4, domain = high_inc@data$mean_income)
  • Example of coloring using colorNumeric()
    • nc_pal <- colorNumeric(“Blues”, domain = high_inc@data$mean_income)
    • previewColors(pal = nc_pal, values = c(seq(100000, 600000, by = 100000))) # explore sample values
    • shp %>% leaflet() %>% # addTiles() %>% addPolygons(weight = 1, fillOpacity = 1, color = ~nc_pal(mean_income), label = ~paste0(“Mean Income:”, dollar(mean_income)), highlight = highlightOptions(weight = 3, color = “red”, bringToFront = TRUE))
  • Sometimes need to log-transform skewed data for better displays

Putting Everything Together:

  • Leaflet and htmlwidgets for base maps and coloring
  • Base and overlay groups to enhance interactivity
  • Features available in the leaflet.extras function
  • Can piece together a full map that includes both polygons and circle markers
    • leaflet() %>% addTiles(group = “OSM”) %>% addProviderTiles(“CartoDB”, group = “Carto”) %>%
    • addProviderTiles(“Esri”, group = “Esri”) %>%
    • addPolygons(data = shp, weight = 1, fillOpacity = .75, color = ~nc_pal(log(mean_income)), label = ~paste0(“Mean Income:”, dollar(mean_income)), group = “Mean Income”) %>%
    • addCircleMarkers(data = nc_public, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “Public”) %>%
    • addCircleMarkers(data = nc_private, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “Private”) %>%
    • addCircleMarkers(data = nc_profit, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “For-Profit”) %>%
    • addLayersControl(baseGroups = c(“OSM”, “Carto”, “Esri”), overlayGroups = c(“Public”, “Private”, “For-Profit”, “Mean Income”))
  • Can also save a map for future use
    • m <- leaflet() %>% addTiles() %>% addMarkers( data = ipeds, clusterOptions = markerClusterOptions())%>% addPolygons(data = shp)
    • library(htmlwidgets)
    • saveWidget(m, file=“myMap.html”) # saves the file as html

Wrap up - additional resources:

Example code includes:

load("./RInputFiles/nc_zips.Rda")
load("./RInputFiles/wealthiest_zips.Rda")
nc_income <- readr::read_csv("./RInputFiles/mean_income_by_zip_nc.csv")
## Parsed with column specification:
## cols(
##   zipcode = col_double(),
##   returns = col_double(),
##   income = col_double(),
##   mean_income = col_double()
## )
str(nc_income, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 723 obs. of  4 variables:
##  $ zipcode    : num  28207 28211 27608 28480 27517 ...
##  $ returns    : num  4470 14060 5690 1510 12710 ...
##  $ income     : num  2.46e+09 3.32e+09 1.13e+09 2.41e+08 1.97e+09 ...
##  $ mean_income: num  550849 235961 197725 159617 154682 ...
# Print a summary of the `shp` data
summary(shp)
## Loading required package: sp
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min       max
## x -84.32187 -75.46089
## y  33.84232  36.58812
## Is projected: FALSE 
## proj4string :
## [+init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84
## +towgs84=0,0,0]
## Data attributes:
##     GEOID10         ALAND10   
##  27006  :  1   100240769:  1  
##  27007  :  1   100252722:  1  
##  27009  :  1   1003885  :  1  
##  27011  :  1   100620829:  1  
##  27012  :  1   100707703:  1  
##  27013  :  1   101001856:  1  
##  (Other):802   (Other)  :802
# Print the class of `shp`
class(shp)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
# Print the slot names of `shp`
slotNames(shp)
## [1] "data"        "polygons"    "plotOrder"   "bbox"        "proj4string"
# Glimpse the data slot of shp
glimpse(shp@data)
## Observations: 808
## Variables: 2
## $ GEOID10 <fct> 27925, 28754, 28092, 27217, 28711, 28666, 28602, 27841...
## $ ALAND10 <fct> 624688620, 223734670, 317180853, 318965510, 258603117,...
# Print the class of the data slot of shp
class(shp@data)
## [1] "data.frame"
# Print GEOID10
shp@data$GEOID10
##   [1] 27925 28754 28092 27217 28711 28666 28602 27841 27831 28785 27504
##  [12] 27330 28768 28658 28716 28139 27565 28394 27982 28025 28159 28382
##  [23] 28312 28342 27839 27852 28723 28077 28039 28452 27306 28375 28713
##  [34] 28743 28717 28150 28447 27205 27379 28425 27827 27540 28114 28451
##  [45] 27892 27249 28628 27873 28781 27916 28705 28714 28101 28102 28445
##  [56] 28448 28458 28719 28478 28479 28501 28748 28752 28207 28753 28757
##  [67] 28209 28212 28560 28504 27983 27985 28018 28019 28562 28906 28530
##  [78] 28771 28779 28782 28376 28581 28152 28169 28170 28657 28021 28204
##  [89] 28533 28540 28543 28551 28262 28280 28575 28790 28792 28667 28672
## [100] 28108 28462 28681 28465 28734 28739 28694 28697 28702 28745 28127
## [111] 28420 28422 28424 28428 28435 28088 28089 28090 27562 28334 28787
## [122] 28433 27360 27534 28043 27370 28444 27531 28675 28712 28449 27053
## [133] 27944 28367 28326 28740 28659 28282 27244 27597 27017 28761 28457
## [144] 28441 27956 27889 28652 28146 28513 28777 28786 27596 27530 28369
## [155] 28327 27340 27028 27823 27879 28244 27810 27886 28306 27025 27239
## [166] 27967 27824 27826 27834 27030 28358 28365 27520 27524 27525 27526
## [177] 27292 27874 27882 27883 27885 27253 27576 27577 27582 27295 27298
## [188] 27332 27910 27052 27055 27344 27516 27850 27856 27265 27603 27605
## [199] 27537 27539 27541 28601 28604 27809 27278 27284 27371 27201 27312
## [210] 28320 28325 27207 28330 28607 28611 28612 27549 27555 27317 27320
## [221] 27703 27709 28350 28643 28337 28621 27569 28645 28651 27948 28630
## [232] 27923 27929 27936 27943 28721 28512 27546 27891 28379 27822 27909
## [243] 28655 28662 27587 27589 28625 28742 28553 27941 28134 27043 27893
## [254] 28328 28135 28007 28338 27110 28472 28756 28110 28519 27861 27407
## [265] 28374 28211 28668 27214 27965 27949 27806 28340 27917 27288 27563
## [276] 28669 27229 27283 27109 27843 27047 28303 28585 28676 28689 28305
## [287] 28635 28640 27016 27863 27968 28528 27915 27981 28411 28577 27326
## [298] 27954 28556 27105 27545 27813 27974 27301 28168 28670 28801 27050
## [309] 28610 28665 28125 28538 27849 28036 28586 27801 27807 28904 27875
## [320] 28557 27958 28468 27536 28213 28341 28747 28707 27262 28006 28360
## [331] 28031 27845 28166 28616 27572 27014 27503 27011 28572 28386 27291
## [342] 28432 27804 27343 28073 28467 28173 28539 28352 27828 28515 28555
## [353] 27855 27583 28310 28396 28348 28138 28642 27542 27408 28215 27821
## [364] 28105 28270 28206 28301 27876 28627 27019 28574 28647 28806 27349
## [375] 28091 28660 28726 28508 27840 28803 28511 27964 27978 28086 27927
## [386] 28774 28383 27559 28523 28332 28749 27962 27455 28056 27501 28027
## [397] 27527 27282 27837 28682 27310 28356 27233 27231 27006 28144 27857
## [408] 27042 28314 27612 28525 27281 28147 28366 28629 27523 27937 28119
## [419] 28012 27048 27880 27350 27027 27606 27938 28638 28720 28580 27103
## [430] 27986 28001 28034 28393 28032 28040 28677 28395 28391 28678 28399
## [441] 28455 28098 28401 28103 28684 28685 28409 28071 28683 28083 28708
## [452] 28097 28450 28431 28453 28454 28709 28439 28377 28715 28443 28436
## [463] 28438 28751 28129 28133 28763 28109 28120 28466 28746 28137 28480
## [474] 28759 28731 28762 28405 28054 28698 28081 28403 28052 28701 28690
## [485] 28412 28704 28078 28421 28693 28544 28516 28773 28775 28905 28174
## [496] 28203 28570 28208 28210 28202 28804 28805 28791 28901 28547 28107
## [507] 28722 28729 28461 28730 28463 28552 28554 28115 28732 28112 28214
## [518] 28733 28308 28304 28571 28584 28582 28583 28273 28587 28278 28578
## [529] 28579 28323 28164 28605 28518 28520 28526 28783 28529 28167 28521
## [540] 28531 28311 28163 28537 28772 28626 27942 27928 28634 28649 28339
## [551] 28357 27935 28623 28618 28654 28624 28619 27922 28307 28226 27946
## [562] 27947 28347 28349 28227 28637 27926 27920 28646 28573 27921 28351
## [573] 28269 28590 27341 28364 27604 27976 28615 27357 28344 28613 28609
## [584] 28343 27409 27376 27377 27701 27610 27979 27405 27704 27705 27959
## [595] 27960 27403 27966 27953 27970 27972 27973 27707 27957 27401 27517
## [606] 27502 27507 27508 27509 27510 27518 27505 27020 27613 27024 27514
## [617] 27519 27713 27614 27803 27616 27617 27513 27511 27023 27046 27844
## [628] 27869 27853 27051 27041 27521 27871 27872 27842 27106 27830 27846
## [639] 27013 27862 27104 27832 27847 27858 27865 27851 27825 27829 27012
## [650] 27816 27817 27557 27808 27209 27208 27820 27888 27814 27551 27556
## [661] 27045 27235 27560 27215 27054 27248 27242 27260 27243 27258 27581
## [672] 27812 27601 27592 27591 27544 27316 27313 27325 27314 27311 27896
## [683] 27007 28650 28606 27009 28735 28673 28725 28033 27870 27864 28429
## [694] 28384 28663 27022 28333 27574 28524 28527 28277 27263 28023 27573
## [705] 27615 28020 28464 28128 28009 28205 28104 27299 27884 28076 28080
## [716] 28160 28532 27302 28124 27932 27924 28037 27819 27608 28789 28079
## [727] 28398 27553 27878 27018 27040 28392 27315 28594 27950 28442 27410
## [738] 27805 28371 27305 28778 28692 28072 28456 28589 28363 27355 27358
## [749] 28385 28736 27890 27522 28617 28671 28387 28390 27212 27609 27568
## [760] 28679 27881 27101 28622 28644 28631 28636 28373 28345 27712 28117
## [771] 27866 27021 27406 28741 28372 27897 28430 27980 28017 27203 28909
## [782] 27127 27607 27939 28217 28216 27252 28423 28718 27919 28510 28460
## [793] 28434 28470 28766 28546 27818 27529 28469 28016 28075 28318 27107
## [804] 27356 28315 27571 27860 28902
## 33144 Levels: 00601 00602 00603 00606 00610 00612 00616 00617 00622 ... 99929
shp@data$GEOID10 <- as.integer(as.character(shp@data$GEOID10))
str(shp@data$GEOID10)
##  int [1:808] 27925 28754 28092 27217 28711 28666 28602 27841 27831 28785 ...
# Glimpse the nc_income data
glimpse(nc_income)
## Observations: 723
## Variables: 4
## $ zipcode     <dbl> 28207, 28211, 27608, 28480, 27517, 27614, 28173, 2...
## $ returns     <dbl> 4470, 14060, 5690, 1510, 12710, 15670, 21880, 7640...
## $ income      <dbl> 2462295000, 3317607000, 1125055000, 241022000, 196...
## $ mean_income <dbl> 550849.0, 235960.7, 197725.0, 159617.2, 154682.2, ...
# Summarise the nc_income data
summary(nc_income)
##     zipcode         returns          income           mean_income    
##  Min.   :27006   Min.   :  110   Min.   :4.557e+06   Min.   : 26625  
##  1st Qu.:27605   1st Qu.: 1105   1st Qu.:4.615e+07   1st Qu.: 40368  
##  Median :28115   Median : 3050   Median :1.526e+08   Median : 46288  
##  Mean   :28062   Mean   : 5979   Mean   :3.648e+08   Mean   : 53338  
##  3rd Qu.:28521   3rd Qu.: 9050   3rd Qu.:4.670e+08   3rd Qu.: 55917  
##  Max.   :28909   Max.   :37020   Max.   :3.970e+09   Max.   :550849
# Left join nc_income onto shp@data and store in shp_nc_income
shp_nc_income <- shp@data %>% 
                left_join(nc_income, by = c("GEOID10" = "zipcode"))

# Print the number of missing values of each variable in shp_nc_income
shp_nc_income %>%
  summarise_all(funs(sum(is.na(.))))
##   GEOID10 ALAND10 returns income mean_income
## 1       0       0      85     85          85
shp <- merge(shp, shp_nc_income, by=c("GEOID10", "ALAND10"))


# map the polygons in shp
shp %>% 
    leaflet() %>% 
    addTiles() %>% 
    addPolygons()
# which zips were not in the income data?
shp_na <- shp[is.na(shp$mean_income),]

# map the polygons in shp_na
shp_na %>% 
    leaflet() %>% 
    addTiles() %>% 
    addPolygons()
# summarise the mean income variable
summary(shp$mean_income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   26625   40368   46288   53338   55917  550849      85
# subset shp to include only zip codes in the top quartile of mean income
high_inc <- shp[!is.na(shp$mean_income) & shp$mean_income > 55917,]

# map the boundaries of the zip codes in the top quartile of mean income
high_inc %>%
  leaflet() %>%
  addTiles() %>%
  addPolygons()
dollar <- function (x, negative_parens=TRUE, prefix="$", suffix="") {
    # KLUGE to make this work . . . 
    needs_cents <- function(...) { FALSE }
    if (length(x) == 0) 
        return(character())
    x <- plyr::round_any(x, 0.01)
    if (needs_cents(x, largest_with_cents)) {
        nsmall <- 2L
    }
    else {
        x <- plyr::round_any(x, 1)
        nsmall <- 0L
    }
    negative <- !is.na(x) & x < 0
    if (negative_parens) {
        x <- abs(x)
    }
    amount <- format(abs(x), nsmall = nsmall, trim = TRUE, big.mark = ",", scientific = FALSE, digits = 1L)
    if (negative_parens) {
        paste0(ifelse(negative, "(", ""), prefix, amount, suffix, ifelse(negative, ")", ""))
    }
    else {
        paste0(prefix, ifelse(negative, "-", ""), amount, suffix)
    }
}


# create color palette with colorNumeric()
nc_pal <- colorNumeric("YlGn", domain = high_inc@data$mean_income)

high_inc %>%
  leaflet() %>%
  addTiles() %>%
  # set boundary thickness to 1 and color polygons blue
  addPolygons(weight = 1, color = ~nc_pal(mean_income),
              # add labels that display mean income
              label = ~paste0("Mean Income: ", dollar(mean_income)),
              # highlight polygons on hover
              highlight = highlightOptions(weight = 5, color = "white",
              bringToFront = TRUE))
# Create a logged version of the nc_pal color palette
nc_pal <- colorNumeric("YlGn", domain = log(high_inc@data$mean_income))

# apply the nc_pal
high_inc %>%
  leaflet() %>%
  addProviderTiles("CartoDB") %>%
  addPolygons(weight = 1, color = ~nc_pal(log(mean_income)), fillOpacity = 1,
              label = ~paste0("Mean Income: ", dollar(mean_income)),
              highlightOptions = highlightOptions(weight = 5, color = "white", bringToFront = TRUE))
# Print the slot names of `wealthy_zips`
slotNames(wealthy_zips)
## [1] "data"        "polygons"    "plotOrder"   "bbox"        "proj4string"
# Print a summary of the `mean_income` variable
summary(wealthy_zips$mean_income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  200444  229914  279330  339859  371904 2553591
# plot zip codes with mean incomes >= $200k
wealthy_zips %>% 
  leaflet() %>% 
  addProviderTiles("CartoDB") %>% 
  addPolygons(weight = 1, fillOpacity = .7, color = "Green",  group = "Wealthy Zipcodes", 
              label = ~paste0("Mean Income: ", dollar(mean_income)),
              highlightOptions = highlightOptions(weight = 5, color = "white", bringToFront = TRUE))
# Add polygons using wealthy_zips
final_map <- m4 %>% 
   addPolygons(data = wealthy_zips, weight = 1, fillOpacity = .5, color = "Grey",  group = "Wealthy Zip Codes", 
              label = ~paste0("Mean Income: ", dollar(mean_income)),
              highlight = highlightOptions(weight = 5, color = "white", bringToFront = TRUE)) %>% 
    # Update layer controls including "Wealthy Zip Codes"
    addLayersControl(baseGroups = c("OSM", "Carto", "Esri"), 
                         overlayGroups = c("Public", "Private", "For-Profit", "Wealthy Zip Codes"))

# Print and explore your very last map of the course!
final_map

Support Vector Machines in R

Chapter 1 - Introduction

Sugar content of soft drinks:

  • Course covers Support Vector Machines (SVM), including visualization, mechanics, situations where they work best, etc.
    • Will stick with binary classification for this course
  • For a 1-dimensional dataset, the clusters can be separated by choosing a “separating boundary” (decision boundary)
  • Margins are the distances between the decision boundary and the closest point
    • The best decision boundary is considered to be the decision boundary that maximizes the margin (more robust to noise)
    • The SVM tries to find the decision boundary that maximizes the margin in n-dimensions

Generating a linearly separable dataset

  • Can use runif to generate random data that is unifotm from 0 to 1
    • n <- 200
    • set.seed(42)
    • df <- data.frame(x1 = runif(n), x2 = runif(n))
  • Can define the points with x1 < x2 as class A and the points with x1 > x2 as class B
    • Can also create a margin by filtering out points where abs(x1-x2) is below a user-specified threshold

Example code includes:

df <- data.frame(sample=1:25, 
                 sugar_content=c(10.9, 10.9, 10.6, 10, 8, 8.2, 8.6, 10.9, 10.7, 8, 7.7, 7.8, 8.4, 11.5, 11.2, 8.9, 8.7, 7.4, 10.9, 10, 11.4, 10.8, 8.5, 8.2, 10.6)
                 )
str(df)
## 'data.frame':    25 obs. of  2 variables:
##  $ sample       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sugar_content: num  10.9 10.9 10.6 10 8 8.2 8.6 10.9 10.7 8 ...
#print variable names
names(df)
## [1] "sample"        "sugar_content"
#build plot
plot_ <- ggplot(data = df, aes(x = sugar_content, y = c(0))) + 
    geom_point() + 
    geom_text(label = df$sugar_content, size = 2.5, vjust = 2, hjust = 0.5)

#display plot
plot_

#The maximal margin separator is at the midpoint of the two extreme points in each cluster.
mm_separator <- (8.9 + 10)/2


#create data frame
separator <- data.frame(sep = c(mm_separator))

#add ggplot layer 
plot_ <- plot_ + 
  geom_point(data = separator, x = separator$sep, y = c(0), color = "blue", size = 4)

#display plot
plot_

#set seed
set.seed(42)

#set number of data points. 
n <- 600

#Generate data frame with two uniformly distributed predictors lying between 0 and 1.
df <- data.frame(x1 = runif(n), x2 = runif(n))

#classify data points depending on location
df$y <- factor(ifelse(df$x2 - 1.4*df$x1 < 0, -1, 1), levels = c(-1, 1))


#set margin
delta <- 0.07

# retain only those points that lie outside the margin
df1 <- df[abs(1.4*df$x1 - df$x2) > delta, ]

#build plot
plot_margins <- ggplot(data = df1, aes(x = x1, y = x2, color = y)) + geom_point() + 
    scale_color_manual(values = c("red", "blue")) + 
    geom_abline(slope = 1.4, intercept = 0)+
    geom_abline(slope = 1.4, intercept = delta, linetype = "dashed") +
    geom_abline(slope = 1.4, intercept = -delta, linetype = "dashed")
 
#display plot 
plot_margins


Chapter 2 - Support Vector Classifiers - Linear Kernels

Linear Support Vector Machines:

  • Can split the data from the previous chapter (perfectly separable) in to train/test on an 80-20 basis
    • set.seed() = 1
    • df[, “train”] <- ifelse(runif(nrow(df))<0.8,1,0)
    • trainset <- df[df$train==1,]
    • testset <- df[df$train==0,]
    • trainColNum <- grep(“train”, names(trainset))
    • trainset <- trainset[,-trainColNum]
    • testset <- testset[,-trainColNum]
  • Decision boundaries have many shapes-types (called kernels) such as lines, polynomials, etc.
  • For this chapter, will use e1071::svm(), a function with many options
    • formula, data, type (“C-classification” for classification), kernel (“linear” for this chapter), cost/gamma (tuning parameters, which will be left at the defaults for now), scale (boolean telling whether to scale the data in advance - FALSE makes for easier plotting, but typically would be set to TRUE in the real-world)
  • Example of running e1071::svm()
    • library(e1071)
    • svm_model<- svm(y ~ ., data = trainset, type = “C-classification”, kernel = “linear”, scale = FALSE)
    • svm_model
    • svm_model$index # indices of the support vectors
    • svm_model$SV # support vector coordinates
    • svm_model$rho # negative y-intercept of the decision boundary
    • svm_model$coefs # weighting coefficients of support vectors (magnitude is importance, side is which part of boundary)
    • pred_train <- predict(svm_model,trainset)
    • pred_test <- predict(svm_model,testset)

Visualizing Linear SVM:

  • Can begin by plotting the training data, distinguished by color
    • p <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + geom_point() + scale_color_manual(values = c(“red”,“blue”))
    • df_sv <- trainset[svm_model$index,]
    • p <- p + geom_point(data = df_sv, aes(x = x1, y = x2), color = “purple”, size = 4, alpha = 0.5)
    • p
  • The support vectors tend to be close to the decision boundary - in fact, they are defined as points that “support” the boundary
  • Goal is to extract the slope and coefficients from the model (not stored in the model object)
    • w <- t(svm_model\(coefs) %*% svm_model\)SV
    • slope_1 <- -w[1]/w[2]
    • intercept_1 <- svm_model$rho/w[2]
    • p <- p + geom_abline(slope = slope_1, intercept = intercept_1)
    • p <- p + geom_abline(slope = slope_1, intercept = intercept_1-1/w[2], linetype = “dashed”) + geom_abline(slope = slope_1, intercept = intercept_1+1/w[2], linetype = “dashed”)
    • p
  • There are several properties observed in the plot
    • The boundary is supported by the support vectors
    • The boundary is “soft”, which allows for uncertainty in location/shape of the boundary
    • Can also use e1071::plot(x=myModel, data=myData) to plot the function

Tuning Linear SVM:

  • Can tweak the cost parameter to change the size of the soft boundary for the SVM
    • Higher costs lead to harder (smaller, narrower) decision boundaries, with fewer support vectors
    • The implication is that raising the cost can be a good idea if the data are known to be linearly separable

Multi-class problems:

  • SVM can manage classification problems with 3+ target types also - using the example iris data
    • p <- ggplot(data = iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) + geom_point()
    • p
  • The SVM at core is a binary classifier, but can be used in a multi-class setting
    • Have a model for each of the choose(m, 2) possible combinations, and use majority voting on the outputs (ties broken by random)
    • This method is called the “one against one” classification, and it is automatically included in e1071
    • svm_model<- svm(Species ~ ., data = trainset, type = “C-classification”, kernel = “linear”) # all run automatically

Example code includes:

dfOld <- df
delta <- 0.07
df <- df[abs(1.4*df$x1 - df$x2) > delta, ]


#split train and test data in an 80/20 proportion
df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)

#assign training rows to data frame trainset
trainset <- df[df$train == 1, ]
#assign test rows to data frame testset
testset <- df[df$train == 0, ]

#find index of "train" column
trainColNum <- grep("train", names(df))

#remove "train" column from train and test dataset
trainset <- trainset[, -trainColNum]
testset <- testset[, -trainColNum]


library(e1071)

#build svm model, setting required parameters
svm_model<- svm(y ~ ., 
                data = trainset, 
                type = "C-classification", 
                kernel = "linear", 
                scale = FALSE)


#list components of model
names(svm_model)
##  [1] "call"            "type"            "kernel"         
##  [4] "cost"            "degree"          "gamma"          
##  [7] "coef0"           "nu"              "epsilon"        
## [10] "sparse"          "scaled"          "x.scale"        
## [13] "y.scale"         "nclasses"        "levels"         
## [16] "tot.nSV"         "nSV"             "labels"         
## [19] "SV"              "index"           "rho"            
## [22] "compprob"        "probA"           "probB"          
## [25] "sigma"           "coefs"           "na.action"      
## [28] "fitted"          "decision.values" "terms"
#list values of the SV, index and rho
svm_model$SV
##               x1          x2
## 11  0.4577417762 0.476919189
## 19  0.4749970816 0.486642912
## 45  0.4317512489 0.520339758
## 58  0.1712643304 0.100229354
## 61  0.6756072745 0.772399305
## 69  0.6932048204 0.838569788
## 99  0.7439746463 0.912029979
## 101 0.6262453445 0.765520479
## 103 0.2165673110 0.202548483
## 118 0.3556659538 0.298152283
## 143 0.4640695513 0.535269056
## 144 0.7793681615 0.941694443
## 147 0.1701624813 0.050030747
## 173 0.4140496817 0.380267640
## 176 0.1364903601 0.011009041
## 180 0.7690324257 0.951921815
## 194 0.1290892835 0.021196302
## 199 0.7431877197 0.824081728
## 204 0.4427962683 0.532290264
## 209 0.2524584394 0.281511990
## 226 0.8205145481 0.962842692
## 253 0.2697161783 0.288755647
## 268 0.2050496121 0.182046106
## 272 0.7853494422 0.870432480
## 278 0.4037828147 0.476424339
## 286 0.1709963905 0.164468810
## 294 0.3864540118 0.370921416
## 295 0.3324459905 0.382318948
## 325 0.5648222226 0.618285144
## 338 0.3169501573 0.333509587
## 341 0.4091320913 0.496387038
## 344 0.3597852497 0.345139100
## 393 0.6568108753 0.815567016
## 400 0.0755990995 0.007417523
## 406 0.1079870730 0.022227321
## 413 0.2401496081 0.151690785
## 427 0.4664852461 0.464965629
## 443 0.3626018071 0.369346223
## 450 0.0619409799 0.011438249
## 466 0.6399842701 0.695480783
## 479 0.1730011790 0.136427131
## 503 0.5195604505 0.627322678
## 525 0.6494539515 0.833293378
## 526 0.6903516576 0.790328991
## 535 0.4243346907 0.470753220
## 590 0.7148487861 0.902375512
## 595 0.8058112133 0.937903824
## 600 0.4587231132 0.446819442
## 15  0.4622928225 0.839631285
## 29  0.4469696281 0.721333573
## 37  0.0073341469 0.108096598
## 38  0.2076589728 0.519075874
## 59  0.2610879638 0.472588875
## 90  0.3052183695 0.548420829
## 92  0.0002388966 0.122946701
## 102 0.2171576982 0.505044580
## 104 0.3889450287 0.717138722
## 129 0.2335235255 0.439058027
## 132 0.6034740848 0.958318281
## 133 0.6315072989 0.970767964
## 158 0.0290858189 0.148069276
## 175 0.4274944656 0.725024226
## 178 0.5923042425 0.900228734
## 189 0.1333296183 0.390023998
## 196 0.0531294835 0.276241161
## 202 0.5171110556 0.899924811
## 210 0.2596899802 0.503687580
## 215 0.4513108502 0.743930877
## 229 0.0483467767 0.218475638
## 232 0.1590223818 0.402696270
## 237 0.0865806018 0.263718613
## 239 0.5545858634 0.935806216
## 249 0.4992728804 0.812805236
## 258 0.5397982858 0.932383237
## 276 0.3367135401 0.672058288
## 293 0.3152607968 0.625878707
## 309 0.3199476011 0.541676977
## 311 0.1078112544 0.374908455
## 378 0.1084886545 0.376079086
## 409 0.0842775232 0.235715229
## 419 0.4264662997 0.798480970
## 420 0.0600483362 0.298929408
## 430 0.5141573721 0.908452330
## 451 0.4309255683 0.821331850
## 477 0.5964720468 0.913432184
## 481 0.2329343846 0.409654615
## 482 0.5770482090 0.969947845
## 488 0.2485451805 0.533491509
## 520 0.5784583788 0.907620618
## 524 0.1270027745 0.348539336
## 530 0.2665205784 0.458110426
## 540 0.2131855546 0.530223881
## 558 0.2770604359 0.510976796
## 562 0.2056735931 0.433566746
## 580 0.5705413527 0.994652604
## 581 0.2458533479 0.494881822
svm_model$index
##  [1]   8  11  30  39  42  48  71  73  75  85 103 104 105 124 127 131 141
## [18] 145 149 153 167 188 196 199 204 208 215 216 240 251 254 257 290 297
## [35] 300 306 317 329 335 347 353 372 390 391 398 440 443 445   9  18  23
## [52]  24  40  63  65  74  76  92  95  96 114 126 129 137 143 148 154 157
## [69] 170 173 178 180 186 192 202 214 228 229 281 303 310 311 318 336 351
## [86] 355 356 362 385 389 395 402 415 419 431 432
svm_model$rho
## [1] -0.1641859
#compute training accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 1
#compute test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 1
#build scatter plot of training dataset
scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
    geom_point() + 
    scale_color_manual(values = c("red", "blue"))
 
#add plot layer marking out the support vectors 
layered_plot <- 
    scatter_plot + geom_point(data = trainset[svm_model$index, ], aes(x = x1, y = x2), color = "purple", size = 4, alpha = 0.5)

#display plot
layered_plot

#calculate slope and intercept of decision boundary from weight vector and svm model
w <- c(x1=6.55241, x2=-4.73278)  # calculated manually outside of this module
slope_1 <- -w[1]/w[2]
intercept_1 <- svm_model$rho/w[2]

#build scatter plot of training dataset
scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
    geom_point() + scale_color_manual(values = c("red", "blue"))
#add decision boundary
plot_decision <- scatter_plot + geom_abline(slope = slope_1, intercept = intercept_1) 
#add margin boundaries
plot_margins <- plot_decision + 
 geom_abline(slope = slope_1, intercept = intercept_1 - 1/w[2], linetype = "dashed")+
 geom_abline(slope = slope_1, intercept = intercept_1 + 1/w[2], linetype = "dashed")
#display plot
plot_margins

#build svm model
svm_model<- 
    svm(y ~ ., data = trainset, type = "C-classification", 
        kernel = "linear", scale = FALSE)

#plot decision boundaries and support vectors
plot(x = svm_model, data = trainset)

#build svm model, cost = 1
svm_model_1 <- svm(y ~ .,
                   data = trainset,
                   type = "C-classification",
                   cost = 1,
                   kernel = "linear",
                   scale = FALSE)

#print model details
svm_model_1
## 
## Call:
## svm(formula = y ~ ., data = trainset, type = "C-classification", 
##     cost = 1, kernel = "linear", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  96
#build svm model, cost = 100
svm_model_100 <- svm(y ~ .,
                   data = trainset,
                   type = "C-classification",
                   cost = 100,
                   kernel = "linear",
                   scale = FALSE)

#print model details
svm_model_100
## 
## Call:
## svm(formula = y ~ ., data = trainset, type = "C-classification", 
##     cost = 100, kernel = "linear", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  100 
## 
## Number of Support Vectors:  6
# Create the base train_plot
train_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
    geom_point() + scale_color_manual(values = c("red", "blue"))
w_1 <- c(x1=6.55241, x2=-4.73278)  # calculated manually outside of this module
w_100 <- c(x1=18.3097, x2=-13.09972)  # calculated manually outside of this module
intercept_1 <- -0.005515526  # calculated outside of this module
intercept_100 <- 0.001852543  # calculated outside of this module
slope_1 <- -w_1[1]/w_1[2]
slope_100 <- -w_100[1]/w_100[2]


#add decision boundary and margins for cost = 1 to training data scatter plot
train_plot_with_margins <- train_plot + 
    geom_abline(slope = slope_1, intercept = intercept_1) +
    geom_abline(slope = slope_1, intercept = intercept_1 - 1/w_1[2], linetype = "dashed")+
    geom_abline(slope = slope_1, intercept = intercept_1 + 1/w_1[2], linetype = "dashed")

#display plot
train_plot_with_margins

#add decision boundary and margins for cost = 100 to training data scatter plot
train_plot_with_margins <- train_plot_with_margins + 
    geom_abline(slope = slope_100, intercept = intercept_100, color = "goldenrod") +
    geom_abline(slope = slope_100, intercept = intercept_100 - 1/w_100[2], linetype = "dashed", color = "goldenrod")+
    geom_abline(slope = slope_100, intercept = intercept_100 + 1/w_100[2], linetype = "dashed", color = "goldenrod")

#display plot 
train_plot_with_margins

svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear", scale = FALSE)

#compute training accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 1
#compute test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 1
#plot
plot(svm_model, trainset)

data(iris)
nTrials <- 100
accuracy <- numeric(nTrials)

#calculate accuracy for n distinct 80/20 train/test partitions
for (i in 1:nTrials){ 
    iris[, "train"] <- ifelse(runif(nrow(iris))<0.8, 1, 0)
    trainColNum <- grep("train", names(iris))
    trainset <- iris[iris$train == 1, -trainColNum]
    testset <- iris[iris$train == 0, -trainColNum]
    svm_model <- svm(Species~ ., data = trainset, 
                     type = "C-classification", kernel = "linear")
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$Species)
}

#mean accuracy and standard deviation
mean(accuracy) 
## [1] 0.9643194
sd(accuracy)
## [1] 0.03704363

Chapter 3 - Polynomial Kernels

Generating radially separable datasets:

  • The goal is to generate 2D points (again uniformly distributed on x1 and x2 using runif)
  • Can then define a value for whether the points are within x of the center
    • radius <- 0.7
    • radius_squared <- radius^2
    • df\(y <- factor(ifelse(df\)x1^2 + df$x2^2 < radius_squared, -1, 1), levels = c(-1,1))
    • p <- ggplot(data = df, aes(x = x1, y = x2, color = y)) + geom_point() + scale_color_manual(values = c(“-1” = “red”,“1” = “blue”))
    • p
  • Can add a circular boundary
    • circle <- function(x1_center, x2_center, r, npoint = 100){
    • #angular spacing of 2*pi/npoint between points
    • theta <- seq(0,2*pi,length.out = npoint)
    • x1_circ <- x1_center + r * cos(theta)
    • x2_circ <- x2_center + r * sin(theta)
    • return(data.frame(x1c = x1_circ, x2c = x2_circ))
    • }
    • boundary <- circle(x1_center = 0, x2_center = 0, r = radius)
    • p <- p + geom_path(data = boundary, aes(x = x1c, y = x2c), inherit.aes = FALSE)

Linear SVM on radially separable datasets:

  • The linear SVM will perform poorly on the radially separable dataset
    • svm_model<- svm(y ~ ., data=trainset, type=“C-classification”, kernel=“linear”)
    • svm_model
    • pred_test <- predict(svm_model,testset)
    • plot(svm_model,trainset) # all points are classified as 1

Kernel trick - devise a mathematical transformation that makes the data linearly separable:

  • For a circles could map X1 = x12 and X2 = x22, where X1 + X2 = 0.49 (which is linearly separable)
  • The polynomial kernel has a degree (e.g., 1 for linear, 2 for quadratic, etc.) and tuning parameters gamma and coef0
    • The kernel also uses u dot v where u and v are vectors belonging to the dataset
    • (gamma * (u dot v) + coef0) ** degree
  • Applying the quadratic kernel to the circular data from above
    • svm_model<- svm(y ~ ., data = trainset, type = “C-classification”, kernel = “polynomial”, degree = 2)
    • plot(svm_model, trainset)

Tuning SVM:

  • Set a search range for each parameter, typically as a sequence of variable (e.g., in multiples of 10)
  • For each combination of parameters, build an SVM and assess the out-of-sample accuracy - can become computationally intensive, though
    • tune_out <- tune.svm(x = trainset[,-3], y = trainset[,3], type = “C-classification”, kernel = “polynomial”, degree = 2, cost = 10^(-1:2), gamma = c(0.1,1,10), coef0 = c(0.1,1,10))
    • tune_out\(best.parameters\)cost
    • tune_out\(best.parameters\)gamma
    • tune_out\(best.parameters\)coef0
    • svm_model <- svm(y~ ., data = trainset, type = “C-classification”, kernel = “polynomial”, degree = 2, cost = tune_out\(best.parameters\)cost, gamma = tune_out\(best.parameters\)gamma, coef0 = tune_out\(best.parameters\)coef0)

Example code includes:

#set number of variables and seed
n <- 400
set.seed(1)

#Generate data frame with two uniformly distributed predictors, x1 and x2
df <- data.frame(x1 = runif(n, min = -1, max = 1), x2 = runif(n, min = -1, max = 1))

#We want a circular boundary. Set boundary radius 
radius <- 0.8
radius_squared <- radius^2

#create dependent categorical variable, y, with value -1 or 1 depending on whether point lies
#within or outside the circle.
df$y <- factor(ifelse(df$x1**2 + df$x2**2 < radius_squared, -1, 1), levels = c(-1, 1))


#build scatter plot, distinguish class by color
scatter_plot <- ggplot(data = df, aes(x = x1, y = x2, color = y)) + 
    geom_point() +
    scale_color_manual(values = c("red", "blue"))

#display plot
scatter_plot

inTrain <- sample(1:nrow(df), round(0.75*nrow(df)), replace=FALSE)
trainset <- df[sort(inTrain), ]
testset <- df[-inTrain, ]


#default cost mode;
svm_model_1 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 1, kernel = "linear")

#training accuracy
pred_train <- predict(svm_model_1, trainset)
mean(pred_train == trainset$y)
## [1] 0.64
#test accuracy
pred_test <- predict(svm_model_1, testset)
mean(pred_test == testset$y)
## [1] 0.48
#cost = 100 model
svm_model_100 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 100, kernel = "linear")

#accuracy
pred_train <- predict(svm_model_100, trainset)
mean(pred_train == trainset$y)
## [1] 0.64
pred_test <- predict(svm_model_100, testset)
mean(pred_test == testset$y)
## [1] 0.48
#print average accuracy and standard deviation
accuracy <- rep(NA, 100)
set.seed(2)

#comment
for (i in 1:100){
    df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
    trainset <- df[df$train == 1, ]
    testset <- df[df$train == 0, ]
    trainColNum <- grep("train", names(trainset))
    trainset <- trainset[, -trainColNum]
    testset <- testset[, -trainColNum]
    svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear")
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$y)
}

#print average accuracy and standard deviation
mean(accuracy)
## [1] 0.5554571
sd(accuracy)
## [1] 0.04243524
#transform data
df1 <- data.frame(x1sq = df$x1^2, x2sq = df$x2^2, y = df$y)

#plot data points in the transformed space
plot_transformed <- ggplot(data = df1, aes(x = x1sq, y = x2sq, color = y)) + 
    geom_point()+ guides(color = FALSE) + 
    scale_color_manual(values = c("red", "blue"))

#add decision boundary and visualize
plot_decision <- plot_transformed + geom_abline(slope = -1, intercept = 0.64)
plot_decision

# Still want to use the old (non-squared) data
inTrain <- sample(1:nrow(df), round(0.75*nrow(df)), replace=FALSE)
df$train <- NULL
trainset <- df[sort(inTrain), ]
testset <- df[-inTrain, ]

svm_model <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2)

#measure training and test accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 0.9866667
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 0.98
#plot
plot(svm_model, trainset)

#tune model
tune_out <- 
    tune.svm(x = trainset[, -3], y = trainset[, 3], 
             type = "C-classification", 
             kernel = "polynomial", degree = 2, cost = 10^(-1:2), 
             gamma = c(0.1, 1, 10), coef0 = c(0.1, 1, 10))

#list optimal values
tune_out$best.parameters$cost
## [1] 0.1
tune_out$best.parameters$gamma
## [1] 10
tune_out$best.parameters$coef0
## [1] 0.1
#Build tuned model
svm_model <- svm(y ~ ., data = trainset, type = "C-classification", 
                 kernel = "polynomial", degree = 2, 
                 cost = tune_out$best.parameters$cost, 
                 gamma = tune_out$best.parameters$gamma, 
                 coef0 = tune_out$best.parameters$coef0)

#Calculate training and test accuracies
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 0.9966667
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 1
#plot model
plot(svm_model, trainset)


Chapter 4 - Radial Basis Kernel Functions

Generating complex datasets:

  • The RBF kernel is highly flexible, can fit complex boundaries, and is common in the real-world
  • Can generate complex data by using different distributions for x and y
    • n <- 600
    • set.seed(42)
    • df <- data.frame(x1 = rnorm(n, mean = -0.5, sd = 1), x2 = runif(n, min = -1, max = 1))
  • The decision boundary can then be two circles that just barely touch at the origin
    • radius <- 0.7
    • radius_squared <- radius^2
    • center_1 <- c(-0.7,0)
    • center_2 <- c(0.7,0)
    • df\(y <- factor(ifelse( (df\)x1-center_1[1])^2 + (df\(x2-center_1[2])^2 < radius_squared| (df\)x1-center_2[1])^2 + (df$x2-center_2[2])^2 < radius_squared, -1,1), levels = c(-1,1))
    • p <- ggplot(data = df, aes(x = x1, y = x2, color = y)) + geom_point() + guides(color = FALSE) + scale_color_manual(values = c(“red”,“blue”))
    • p
  • Can then build linear, polynomial, and RBF kernels to model the data

Motivating the RBF kernel:

  • Neither the linear kernel nor the polynomial kernel will work well for the dataset as described
  • Can use the heuristic that points near each other probably belong to the same class (similar to kNN)
    • The kernel should have a maximum at (a, b), and should decay as you move away from (a, b)
    • The rate of decay, all else equal should be the same in all directions, with a tunable gamma
    • As good fortune has it, the exponential exp(-gamma * r) has all of these properties
    • rbf <- function(r, gamma) exp(-gamma*r)
    • ggplot(data.frame(r = c(-0, 10)), aes(r)) +
    • stat_function(fun = rbf, args = list(gamma = 0.2), aes(color = “0.2”)) +
    • stat_function(fun = rbf, args = list(gamma = 0.4), aes(color = “0.4”)) +
    • stat_function(fun = rbf, args = list(gamma = 0.6), aes(color = “0.6”)) +
    • stat_function(fun = rbf, args = list(gamma = 0.8), aes(color = “0.8”)) +
    • stat_function(fun = rbf, args = list(gamma = 1), aes(color = “1”)) +
    • stat_function(fun = rbf, args = list(gamma = 2), aes(color = “2”)) +
    • scale_color_manual(“gamma”, values = c(“red”,“orange”,“yellow”, “green”,“blue”,“violet”)) +
    • ggtitle(“Radial basis function (gamma=0.2 to 2)”)

The RBF kernel simulates some of the principles of kNN using exponential decay:

  • The RBF kernel can be built using pre-set R commands
    • svm_model<- svm(y ~ ., data = trainset, type = “C-classification”, kernel = “radial”)
  • The predicted decision boundary will no longer be linear, and can be refined through tuning
    • tune_out <- tune.svm(x = trainset[,-3], y = trainset[,3], gamma = 5*10^(-2:2), cost = c(0.01,0.1,1,10,100), type = “C-classification”, kernel = “radial”)
    • tune_out\(best.parameters\)cost
    • tune_out\(best.parameters\)gamma
    • svm_model <- svm(y~ ., data=trainset, type=“C-classification”, kernel=“radial”, cost=tune_out\(best.parameters\)cost, gamma=tune_out\(best.parameters\)gamma)

Example code includes:

#number of data points
n <- 1000

#set seed
set.seed(1)

#create dataframe
df <- data.frame(x1 = rnorm(n, mean = -0.5, sd = 1), x2 = runif(n, min = -1, max = 1))


#set radius and centers
radius <- 0.8
center_1 <- c(-0.8, 0)
center_2 <- c(0.8, 0)
radius_squared <- radius^2

#create binary classification variable
df$y <- factor(ifelse((df$x1-center_1[1])^2 + (df$x2-center_1[2])^2 < radius_squared |
                      (df$x1-center_2[1])^2 + (df$x2-center_2[2])^2 < radius_squared, -1, 1),
                      levels = c(-1, 1))


#create scatter plot
scatter_plot<- ggplot(data = df, aes(x = x1, y = x2, color = y)) + 
    geom_point() + 
    scale_color_manual(values = c("red", "blue"))
 
scatter_plot 

# Create 75/25 split for train/test
inTrain <- sample(1:nrow(df), round(0.75*nrow(df)), replace=FALSE)
trainset <- df[sort(inTrain), ]
testset <- df[-inTrain, ]


#build model
svm_model <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear")

#accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 0.5853333
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 0.564
#plot model against testset
plot(svm_model, testset)

#build model
svm_model <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2)

#accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 0.8253333
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 0.788
#plot model
plot(svm_model, trainset)

#create vector to store accuracies and set random number seed
accuracy <- rep(NA, 100)
set.seed(2)


# Create a dummy frame dfDum for use in the for loop
dfDum <- df

#calculate accuracies for 100 training/test partitions
for (i in 1:100){
    dfDum[, "train"] <- ifelse(runif(nrow(dfDum))<0.8, 1, 0)
    trainset <- dfDum[dfDum$train == 1, ]
    testset <- dfDum[dfDum$train == 0, ]
    trainColNum <- grep("train", names(trainset))
    trainset <- trainset[, -trainColNum]
    testset <- testset[, -trainColNum]
    svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2)
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$y)
}

#print average accuracy and standard deviation
mean(accuracy)
## [1] 0.804765
sd(accuracy)
## [1] 0.02398396
#create vector to store accuracies and set random number seed
accuracy <- rep(NA, 100)
set.seed(2)

#calculate accuracies for 100 training/test partitions
for (i in 1:100){
    dfDum[, "train"] <- ifelse(runif(nrow(dfDum))<0.8, 1, 0)
    trainset <- dfDum[dfDum$train == 1, ]
    testset <- dfDum[dfDum$train == 0, ]
    trainColNum <- grep("train", names(trainset))
    trainset <- trainset[, -trainColNum]
    testset <- testset[, -trainColNum]
    svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "radial")
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$y)
}

#print average accuracy and standard deviation
mean(accuracy)
## [1] 0.9034203
sd(accuracy)
## [1] 0.01786378
# Re-create original 75/25 split for train/test
inTrain <- sample(1:nrow(df), round(0.75*nrow(df)), replace=FALSE)
trainset <- df[sort(inTrain), ]
testset <- df[-inTrain, ]

#tune model
tune_out <- tune.svm(x = trainset[, -3], y = trainset[, 3], 
                     gamma = 5*10^(-2:2), 
                     cost = c(0.01, 0.1, 1, 10, 100), 
                     type = "C-classification", kernel = "radial")
tune_out
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##      5    1
## 
## - best performance: 0.04
#build tuned model
svm_model <- svm(y~ ., data = trainset, type = "C-classification", kernel = "radial", 
                 cost = tune_out$best.parameters$cost, 
                 gamma = tune_out$best.parameters$gamma)

#calculate test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 0.956
#Plot decision boundary against test data
plot(svm_model, testset)


Experimental Design in R

Chapter 1 - Introduction to Experimental Design

Introduction to experimental design:

  • Experiments start with a question in mind, then finding and analyzing data
  • This course will use open data, meaning that we do not know the original experimental design
  • Key conditions of an experiment include randomization, replication, and blocking

Hypothesis testing:

  • The null hypothesis changes depending on the question of interest - “no effect” (two-sided) or “no positive effect” (one-sided) or etc.
  • Power is the probability that the test correctly reject the null hypothesis when the alternative hypothesis is true (target >= 80%)
  • The effect size is the standardized measure of the difference that you are trying to detect
  • Sample size is generally chosen so that the effect size can be measured at the required power
  • Example of using the power package for calculating the metrics
    • library(pwr)
    • pwr.anova.test(k = 3, n = 20, f = 0.2, sig.level = 0.05, power = NULL) # one must be entered as NULL (this will be calculated) ; k groups with n per group and f effect size

Example code includes:

# load the ToothGrowth dataset
data("ToothGrowth")

#perform a two-sided t-test
t.test(x = ToothGrowth$len, alternative = "two.sided", mu = 18)
## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.4135
## alternative hypothesis: true mean is not equal to 18
## 95 percent confidence interval:
##  16.83731 20.78936
## sample estimates:
## mean of x 
##  18.81333
#perform a t-test
ToothGrowth_ttest <- t.test(len ~ supp, data = ToothGrowth)

#tidy the t-test model object
broom::tidy(ToothGrowth_ttest)
## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
## 1     3.70      20.7      17.0      1.92  0.0606      55.3   -0.171
## # ... with 3 more variables: conf.high <dbl>, method <chr>,
## #   alternative <chr>
#group by supp, dose, then examine how many observations in ToothGrowth there are by those groups
ToothGrowth %>% 
    group_by(supp, dose) %>% 
    summarize(n=n())
## # A tibble: 6 x 3
## # Groups:   supp [2]
##   supp   dose     n
##   <fct> <dbl> <int>
## 1 OJ      0.5    10
## 2 OJ      1      10
## 3 OJ      2      10
## 4 VC      0.5    10
## 5 VC      1      10
## 6 VC      2      10
#create a boxplot with geom_boxplot()
ggplot(ToothGrowth, aes(x=as.factor(dose), y=len)) + 
    geom_boxplot()

#create the ToothGrowth_aov model object
ToothGrowth_aov <- aov(len ~ dose + supp, data = ToothGrowth)

#examine the model object with summary()
summary(ToothGrowth_aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## dose         1 2224.3  2224.3  123.99 6.31e-16 ***
## supp         1  205.3   205.3   11.45   0.0013 ** 
## Residuals   57 1022.6    17.9                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#less than
t.test(x = ToothGrowth$len, alternative = "less", mu = 18)
## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.7933
## alternative hypothesis: true mean is less than 18
## 95 percent confidence interval:
##      -Inf 20.46358
## sample estimates:
## mean of x 
##  18.81333
#greater than
t.test(x = ToothGrowth$len, alternative = "greater", mu = 18)
## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.2067
## alternative hypothesis: true mean is greater than 18
## 95 percent confidence interval:
##  17.16309      Inf
## sample estimates:
## mean of x 
##  18.81333
#calculate power
pwr::pwr.t.test(n = 100, d = 0.35, sig.level = 0.10, type = "two.sample", 
                alternative = "two.sided", power = NULL
                )
## 
##      Two-sample t test power calculation 
## 
##               n = 100
##               d = 0.35
##       sig.level = 0.1
##           power = 0.7943532
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
#calculate sample size
pwr::pwr.t.test(n = NULL, d = 0.25, sig.level = 0.05, 
                type = "one.sample", alternative = "greater", power = 0.8
                )
## 
##      One-sample t test power calculation 
## 
##               n = 100.2877
##               d = 0.25
##       sig.level = 0.05
##           power = 0.8
##     alternative = greater

Chapter 2 - Basic Experiments

Single and Multiple Factor Experiments:

  • The ANOVA (Analysis of Variance) test allows for comparing means across 3-groups; is at least one mean different
    • model_1 <- lm(y ~ x, data = dataset) # first option is lm followed by aov
    • anova(model_1) # first option is lm followed by anova
    • aov(y ~ x, data = dataset) # second option is a straight call to aov
  • The multiple factor experiment includes additional potential explanatory variables
    • model2 <- lm(y ~ x + r + s + t)
    • anova(model2)
  • The Lending Club data is 890k x 75, and contains data from a lending company

Model Validation:

  • EDA is an important step prior to modeling the data
  • Boxplots can be a helpful way to explore the data
    • ggplot(data = lendingclub, aes(x = verification_status, y = funded_amnt)) + geom_boxplot()
  • ANOVA and other linear models generally assume that the residuals are normally distributed

A/B Testing:

  • A/B tests are a type of controlled experiment with only two variants of something
  • Power and sample size are crucial to A/B testing, allowing for an understanding of the required size for a desired power and expected effect size

Example code includes:

lendingclub <- readr::read_csv("./RInputFiles/lendclub.csv")
## Parsed with column specification:
## cols(
##   member_id = col_double(),
##   loan_amnt = col_double(),
##   funded_amnt = col_double(),
##   term = col_character(),
##   int_rate = col_double(),
##   emp_length = col_character(),
##   home_ownership = col_character(),
##   annual_inc = col_double(),
##   verification_status = col_character(),
##   loan_status = col_character(),
##   purpose = col_character(),
##   grade = col_character()
## )
#examine the variables with glimpse()
glimpse(lendingclub)
## Observations: 1,500
## Variables: 12
## $ member_id           <dbl> 55096114, 1555332, 1009151, 69524202, 7212...
## $ loan_amnt           <dbl> 11000, 10000, 13000, 5000, 18000, 14000, 8...
## $ funded_amnt         <dbl> 11000, 10000, 13000, 5000, 18000, 14000, 8...
## $ term                <chr> "36 months", "36 months", "60 months", "36...
## $ int_rate            <dbl> 12.69, 6.62, 10.99, 12.05, 5.32, 16.99, 13...
## $ emp_length          <chr> "10+ years", "10+ years", "3 years", "10+ ...
## $ home_ownership      <chr> "RENT", "MORTGAGE", "MORTGAGE", "MORTGAGE"...
## $ annual_inc          <dbl> 51000, 40000, 78204, 51000, 96000, 47000, ...
## $ verification_status <chr> "Not Verified", "Verified", "Not Verified"...
## $ loan_status         <chr> "Current", "Fully Paid", "Fully Paid", "Cu...
## $ purpose             <chr> "debt_consolidation", "debt_consolidation"...
## $ grade               <chr> "C", "A", "B", "C", "A", "D", "C", "A", "D...
#find median loan_amt, mean int_rate, and mean annual_inc with summarise()
lendingclub %>% summarise(median(loan_amnt), mean(int_rate), mean(annual_inc))
## # A tibble: 1 x 3
##   `median(loan_amnt)` `mean(int_rate)` `mean(annual_inc)`
##                 <dbl>            <dbl>              <dbl>
## 1               13000             13.3             75736.
# use ggplot2 to build a bar chart of purpose
ggplot(data=lendingclub, aes(x = purpose)) + geom_bar()

#use recode() to create the new purpose_recode variable.
lendingclub$purpose_recode <- lendingclub$purpose %>% recode( 
        "credit_card" = "debt_related",
        "debt_consolidation" = "debt_related", 
        "medical" = "debt_related",
        "car" = "big_purchase", 
        "major_purchase" = "big_purchase", 
        "vacation" = "big_purchase",
        "moving" = "life_change", 
        "small_business" = "life_change", 
        "wedding" = "life_change",
        "house" = "home_related", 
        "home_improvement" = "home_related"
        )


#build a linear regression model, stored as purpose_recode_model
purpose_recode_model <- lm(funded_amnt ~ purpose_recode, data = lendingclub)

#look at results of purpose_recode_model
summary(purpose_recode_model)
## 
## Call:
## lm(formula = funded_amnt ~ purpose_recode, data = lendingclub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14472  -6251  -1322   4678  25761 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      9888.1     1248.9   7.917 4.69e-15 ***
## purpose_recodedebt_related       5433.5     1270.5   4.277 2.02e-05 ***
## purpose_recodehome_related       4845.0     1501.0   3.228  0.00127 ** 
## purpose_recodelife_change        4095.3     2197.2   1.864  0.06254 .  
## purpose_recodeother              -649.3     1598.3  -0.406  0.68461    
## purpose_recoderenewable_energy  -1796.4     4943.3  -0.363  0.71636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8284 on 1494 degrees of freedom
## Multiple R-squared:  0.03473,    Adjusted R-squared:  0.0315 
## F-statistic: 10.75 on 5 and 1494 DF,  p-value: 3.598e-10
#get anova results and save as purpose_recode_anova
purpose_recode_anova <- anova(purpose_recode_model)

# look at the class of purpose_recode_anova
class(purpose_recode_anova)
## [1] "anova"      "data.frame"
#Use aov() to build purpose_recode_aov
purpose_recode_aov <- aov(funded_amnt ~ purpose_recode, data = lendingclub)

#Conduct Tukey's HSD test to create tukey_output
tukey_output <- TukeyHSD(purpose_recode_aov)

#tidy tukey_output to make sense of the results
broom::tidy(tukey_output)
## # A tibble: 15 x 6
##    term       comparison            estimate conf.low conf.high adj.p.value
##    <chr>      <chr>                    <dbl>    <dbl>     <dbl>       <dbl>
##  1 purpose_r~ debt_related-big_pur~    5434.    1808.     9059.     2.91e-4
##  2 purpose_r~ home_related-big_pur~    4845.     562.     9128.     1.61e-2
##  3 purpose_r~ life_change-big_purc~    4095.   -2174.    10365.     4.25e-1
##  4 purpose_r~ other-big_purchase       -649.   -5210.     3911.     9.99e-1
##  5 purpose_r~ renewable_energy-big~   -1796.  -15902.    12309.     9.99e-1
##  6 purpose_r~ home_related-debt_re~    -589.   -3056.     1879.     9.84e-1
##  7 purpose_r~ life_change-debt_rel~   -1338.   -6539.     3863.     9.78e-1
##  8 purpose_r~ other-debt_related      -6083.   -9005.    -3160.     5.32e-8
##  9 purpose_r~ renewable_energy-deb~   -7230.  -20894.     6434.     6.58e-1
## 10 purpose_r~ life_change-home_rel~    -750.   -6429.     4929.     9.99e-1
## 11 purpose_r~ other-home_related      -5494.   -9201.    -1787.     3.58e-4
## 12 purpose_r~ renewable_energy-hom~   -6641.  -20494.     7212.     7.46e-1
## 13 purpose_r~ other-life_change       -4745.  -10636.     1147.     1.95e-1
## 14 purpose_r~ renewable_energy-lif~   -5892.  -20482.     8698.     8.59e-1
## 15 purpose_r~ renewable_energy-oth~   -1147.  -15088.    12794.    10.00e-1
#Use aov() to build purpose_emp_aov
purpose_emp_aov <- aov(funded_amnt ~ purpose_recode + emp_length, data=lendingclub)

#print purpose_emp_aov to the console
purpose_emp_aov
## Call:
##    aov(formula = funded_amnt ~ purpose_recode + emp_length, data = lendingclub)
## 
## Terms:
##                 purpose_recode   emp_length    Residuals
## Sum of Squares      3688783338   2044273211 100488872355
## Deg. of Freedom              5           11         1483
## 
## Residual standard error: 8231.679
## Estimated effects may be unbalanced
#call summary() to see the p-values
summary(purpose_emp_aov)
##                  Df    Sum Sq   Mean Sq F value   Pr(>F)    
## purpose_recode    5 3.689e+09 737756668  10.888 2.63e-10 ***
## emp_length       11 2.044e+09 185843019   2.743  0.00161 ** 
## Residuals      1483 1.005e+11  67760534                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#examine the summary of int_rate
summary(lendingclub$int_rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.32    9.99   12.99   13.31   16.29   26.77
#examine int_rate by grade
lendingclub %>% 
    group_by(grade) %>% 
    summarise(mean = mean(int_rate), var = var(int_rate), median = median(int_rate))
## # A tibble: 7 x 4
##   grade  mean   var median
##   <chr> <dbl> <dbl>  <dbl>
## 1 A      7.27 0.961   7.26
## 2 B     10.9  2.08   11.0 
## 3 C     14.0  1.42   14.0 
## 4 D     17.4  1.62   17.6 
## 5 E     20.1  2.71   20.0 
## 6 F     23.6  2.87   23.5 
## 7 G     26.1  0.198  25.9
#make a boxplot of int_rate by grade
ggplot(lendingclub, aes(x = grade, y = int_rate)) + geom_boxplot()

#use aov() to create grade_aov plus call summary() to print results
grade_aov <- aov(int_rate ~ grade, data = lendingclub)
summary(grade_aov)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## grade          6  27013    4502    2637 <2e-16 ***
## Residuals   1493   2549       2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#for a 2x2 grid of plots:
par(mfrow=c(2, 2))

#plot grade_aov
plot(grade_aov)

#back to defaults
par(mfrow=c(1, 1))

#Bartlett's test for homogeneity of variance
bartlett.test(int_rate ~ grade, data=lendingclub)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  int_rate by grade
## Bartlett's K-squared = 78.549, df = 6, p-value = 7.121e-15
#use the correct function from pwr to find the sample size
pwr::pwr.t.test(n=NULL, d=0.2, sig.level=0.05, 
                type="two.sample", alternative="two.sided", power=0.8
                )
## 
##      Two-sample t test power calculation 
## 
##               n = 393.4057
##               d = 0.2
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
lc_A <- c(11976148, 1203719, 54998739, 5801830, 31587242, 7711391, 54494666, 57663583, 8967787, 21760921, 44765721, 8596988, 5794746, 59501253, 10578432, 36058744, 11727607, 357888, 51936863, 1178593, 57315811, 5705168, 46024211, 12947039, 57345207, 55299831, 28763037, 49763149, 20077511, 60216198, 12295190, 1570287, 61408414, 59121340, 32349527, 5773180, 26899704, 55412161, 2217935, 16462713, 9196065, 27802028, 40949245, 56007625, 56935379, 62187473, 20178048, 604912, 58533358, 652594, 44066849, 38942161, 6414816, 65617953, 51816492, 43489983, 6794967, 42345315, 59532019, 13107597, 63249029, 7371829, 12335467, 8560739, 7337238, 887484, 23493355, 41031080, 60537197, 12816159, 38446687, 51026618, 6374688, 18685270, 296645, 44439325, 4915968, 63449566, 25256236, 63407874, 36753301, 20728660, 7937228, 13058684, 636359, 50527238, 40450502, 1018943, 12438198, 3065732, 1510626, 5764344, 37840363, 27460227, 39751366, 5028066, 43956700, 56109033, 1412622, 44289534, 41770436, 49956562, 44409121, 47168726, 60953428, 52189251, 64281487, 51928150, 1002880, 4537354, 12605849, 477843, 6808167, 38629237, 33311208, 36109419, 58593881, 40362979, 440300, 9848361, 30656060, 15691500, 4375269, 15360849, 7077904, 66076532, 33350264, 4175651, 44006939, 21130605, 54098234, 53192890, 7371114, 12967808, 58061230, 34803392, 5544911, 28843825, 63244663, 38504887, 68565204, 1211255, 63427670, 56472411, 10548622, 43957279, 59313014, 5768723, 66210490, 25507112, 55472659, 61339767, 65684813, 45544639, 43710238, 46833245, 13028661, 13167268, 3064642, 62072249, 27631726, 65825964, 15540990, 64320858, 8605358, 17795606, 9894584, 543619, 2380700, 20959552, 57743104, 63917130, 38480348, 61393540, 19916851)
lc_A <- c(lc_A, 12528162, 7264617, 61480809, 36411752, 20139228, 21290880, 390228, 45584424, 17755019, 23413261, 15490914, 1254285, 875004, 24274579, 51006600, 11458143, 5125832, 37802077, 57327243, 41059894, 64978360, 58683523, 4290736, 40919379, 65029207, 7096004, 42285591, 7388784, 65914238, 46833088, 21221678, 62855006, 10557733, 44915714, 23083224, 67289213, 9746670, 349608, 66610322, 1595886, 3635144, 38419356, 9715410, 9726377, 621152, 23213635, 18685424, 65782663, 57304429, 20770003, 8865120, 58664359, 1454540, 42404539, 60952405, 61339308, 7367648, 11215938, 41207320, 23553299, 1681376, 7617266, 30485630, 10604792, 46044414, 63094909, 59189668, 10106916, 52058386, 17763104, 6396213, 8981232, 48070364, 10615808, 11956507, 38444903, 60216940, 58310439, 10099562, 7504691, 17533228, 62236540, 38626163, 55657128, 7728107, 42415348, 42454693, 4777573, 23834164, 25157042, 1339435, 50587486, 55998961, 32950014, 28422748, 492346, 50607472, 11335041, 4254623, 65058537, 5375256, 5646680, 44430975, 4054992, 55253292, 68375791, 16822421, 64978226, 59859214, 65424555, 10112206, 6908772, 67879649, 4794842, 31227479, 17423361, 64049774, 58624386, 14829134, 50233873, 44389635, 29684724, 452267, 43044890, 55942742, 19516366, 34443897, 57135665, 34392172, 17352839, 12896521, 40451807, 43255228, 40372428, 8568706, 68364520, 3486848, 40991148, 19196658, 8658538, 65885614, 38352455, 65674149, 1029473, 39290483, 47420355, 65364529, 32318884, 13115811, 48484348, 65975356, 56129109, 3378980, 31026386, 55231010, 41113253, 1480114, 51406116, 2445051, 8627441, 60942818, 55453270, 58573102, 25767158, 9655554, 49783137, 42273770, 32038806, 681948, 65059359, 48546050, 20169281, 68546780, 7065575, 46387142, 66180493, 58430918, 1390497, 41950574, 39888056, 11774847, 55308824, 51969105, 7936525, 5960208, 7700566, 14529825, 14688918, 43024566, 21110140, 55797803, 31236439, 6817136, 1467168, 36028128, 60781310, 66595886, 57548184, 3194733, 8589175, 1546517, 17654773, 40572454, 63284984, 5780985, 39660177, 64050493, 55081623, 51346675, 1235123, 65633931, 66390924, 17413278, 57950994, 55911330, 11814853, 31357211, 56038385, 40038565, 64400706, 35034758, 60296238, 6527713, 5685238, 1062701, 63406447, 64008930, 63476297, 5114652, 20060374, 10085133, 61328568, 9435001, 56057656, 49934674, 39661404, 19616499, 34342717, 46653815, 45614269, 59290211, 31296803, 50605437, 46928301, 58562582, 63879452, 65733359, 51086476, 40601201, 9845217, 29213549, 41227222, 7337659, 46517072, 38610653, 9694813, 21350102, 46716202, 50535150, 39729407, 22263578, 25987787, 64913590, 19636684, 59311687, 4295372, 571012, 20588847, 63424767, 1099384, 3810242, 5604591, 39760687, 43739869, 56019939, 51526987, 45494853, 4302122, 21009984, 66210827, 67255219, 46613149, 63345017, 43570211, 62002161, 2214708, 4234697, 51055338, 19647002, 28593783, 6804647, 40542044, 42263319, 4784593, 19636686, 44015285, 55697847, 5814660, 15409525, 2307393, 54404433, 15490230, 62245810, 64969544, 48120716, 41040511, 51176224, 6376426, 60386775, 826517, 27601385, 8185587, 28564285, 68613325, 58623041, 60941473, 1635691, 7729270, 46417835, 57285778, 55960993, 66510262, 60285691, 61902329, 68565071)


lc_B <- c(62012715, 49974687, 27570947, 63417796, 61449107, 12906517, 57074291, 21021086, 404854, 15139172, 46774978, 50486061, 4305577, 65783354, 48544529, 31667129, 36980133, 19117791, 3845908, 846821, 40381968, 64018601, 57184860, 49963980, 44142706, 6327771, 20811335, 67336862, 3628833, 31247310, 4764984, 1619549, 56492219, 67959628, 61672211, 1472227, 55268407, 13497237, 57538143, 43096178, 35723158, 226780, 2307012, 1210773, 50273799, 28903599, 50839792, 44916418, 9714937, 51876659, 3919804, 12968154, 54978278, 6938022, 53854432, 63350177, 39692948, 67216234, 22253060, 59099446, 46135199, 11717805, 48596572, 8475061, 61462130, 21480483, 2014943, 41430440, 43196143, 243173, 61543762, 66562164, 67878273, 41100627, 11915326, 28753020, 12617369, 59090559, 55583726, 31256585, 544537, 61430245, 1681767, 7670078, 38506546, 36500594, 31367711, 46694948, 2080069, 38457330, 54524836, 27651989, 63358477, 62002922, 8995111, 45694307, 61470409, 17933815, 27370082, 66612753, 1536521, 54948920, 57548472, 876991, 40127147, 57365210, 1904740, 3195692, 743529, 67408356, 8766184, 23643466, 51336378, 13397002, 3700020, 49935259, 38455198, 63506356, 11386690, 32479126, 6300017, 67427011, 63344398, 51366616, 727247, 59291548, 21551336, 8776003, 16111335, 1051513, 61973285, 60764833, 59190150, 25406927, 10138072, 61361677, 32279884, 63337618, 49933340, 30565592, 3217416, 61883095, 63436296, 58290318, 29884855, 50353289, 14699170, 67625637, 6815821, 2286867, 6274586, 17853756, 55948157, 6995898, 44126015, 66643915, 41338910, 8626219, 67858810, 38597465, 45884338, 565018, 46436141, 15259622, 6594706, 39479497, 5535388, 5855546, 48734782, 2896555, 67296211, 713979, 33110251, 8987918, 1224687, 5637315, 484473, 9814600, 29694710, 60902260, 25897153, 40705483, 1439301, 3055155, 26319992, 6245002, 66441896, 46427698, 36330836, 8915199, 46205024, 62459417, 3497439, 54888931, 30475522, 38998249, 12636103, 60536957)
lc_B <- c(lc_B, 27521279, 2365984, 361549, 43430210, 35843833, 9768308, 12705933, 59179388, 60830121, 67929084, 36138408, 854552, 8865548, 13096420, 23836169, 61502149, 1621627, 11426617, 48274995, 41123011, 7296181, 29635336, 30565882, 8145149, 46116481, 21119590, 43894290, 65866235, 44143687, 873468, 12419378, 26378681, 55140334, 56964922, 61682200, 14338072, 65047247, 57267246, 59581503, 41093708, 48524124, 513842, 1685090, 42723216, 60647576, 55341080, 9735578, 41110083, 30255415, 56010965, 63214550, 67828966, 671468, 38540004, 65107371, 18645038, 26017706, 660734, 573283, 9454644, 64017354, 617449, 7645594, 43286428, 55941273, 8636865, 31226902, 46194753, 6160505, 1412225, 65741544, 24084859, 58532795, 41880754, 45515321, 60585561, 65272380, 7937327, 1489732, 17553239, 7638498, 1473206, 38162164, 3355990, 15610681, 57025137, 6254978, 38162571, 52768311, 5938741, 58101279, 18895673, 30175739, 38222417, 55909312, 65663878, 6607837, 24725076, 61722475, 11895058, 28182084, 185962, 55259655, 16241080, 66602227, 5781939, 60801476, 6996130, 12346893, 65672013, 19076244, 1475379, 9056893, 59492895, 56864322, 60942704, 44015940, 62225220, 39739191, 66435524, 44199929, 59471139, 38547168, 6205030, 38615829, 6698930, 66514563, 1623685, 60545969, 46703319, 39739315, 12636426, 65364691, 16403147, 9204637, 19306532, 66270322, 65653692, 22313524, 59082682, 19796545, 10766253, 50436003, 49363132, 27600713, 44865530, 57763719, 47857115, 48535477, 65986020, 58603818, 42934257, 1167844, 66390187, 58281312, 63888770, 48596526, 67385135, 24775459, 55090096, 12347068, 37317537, 64007908, 1683908, 11976597, 41019342, 6855113, 7964638, 65701227, 44037648, 23133074, 9787718, 61389384, 38418035, 33130454, 13038119, 14639242, 38505864, 65725266, 62904623, 68513661, 36039498, 6538734, 51857455, 59139740, 64341225, 21430833, 55455899, 17795459, 65128493, 46428798, 43216120, 59199242, 50364311, 41079485, 27711293, 63218354, 65492649, 50819365, 40737432, 377507, 65736437, 61488876, 44886450, 31467727, 46651816, 11914779, 65352381, 24726593, 52989922, 43105128, 34322310, 8669148, 12795739, 38485516, 39559934, 4280915, 63437401, 7103037, 44946049, 15400322, 28583975, 59592185, 877645, 56019484, 3372858, 60556772, 19846532, 11658194, 6894823, 61414862, 52708301, 48806212, 12204849, 60863986, 3919883, 37661631, 47210580, 14689912, 23393084, 60961679, 6170889, 55191727, 14690280, 42415518, 65855022, 62156039, 38536464, 44603544, 63527328, 48182146, 25867085, 61952845, 4744682, 20110370, 65854766, 57722242, 11438361, 34111919, 53262232, 12247443, 64210396, 37630339, 41237564, 46722148, 65791211, 16882760, 7719304, 37622016, 3220774, 51906280, 12446784, 50064210, 57733299, 63437152, 38445791, 3730324, 56052115, 57354312, 58010576, 626701, 7224706, 64079786, 62167132, 8396526, 7625377, 12707224, 35084508, 56022111, 52027979, 43215589, 50425264, 59253209, 28312549, 67376619, 30795837, 43869662, 20849433, 55351366, 39549686, 22972745, 1025579)


# The specific member IDs in lc_A and lc_B are not in dataset lendingclub
lendingclub_ab <- lendingclub %>%
    mutate(Group=ifelse(member_id %in% lc_A, "A", ifelse(member_id %in% lc_B, "B", "C")))


# ggplot(lendingclub_ab, aes(x=Group, y=loan_amnt)) + geom_boxplot()

#conduct a two-sided t-test
# t.test(loan_amnt ~ Group, data=lendingclub_ab)


#build lendingclub_multi
# lendingclub_multi <-lm(loan_amnt ~ Group + grade + verification_status, data=lendingclub_ab)

#examine lendingclub_multi results
# broom::tidy(lendingclub_multi)

Chapter 3 - Randomized Complete (and Balanced Incomplete) Block Designs

Intro to NHANES Dataset and Sampling:

  • NHANES is the National Health and Nutrition Examination Study, run once every 2 years in the US since the late 1990s (was run on different frequency since the 1960s)
  • NHANES individuals are sampled from a scheme to match the US demographics - upsampling of elderly and minorities for sufficient sample size for statistical conclusions
  • Two key types of sampling
    • Probability sampling - probability is used to select the sample (will be covered in this course)
    • Non-probability sampling - voluntary (whoever responds), convenience (whoever the researcher can find)
  • Many types of random sampling can be run in R
    • Simple Random Sampling - sample()
    • Stratified Sampling - dataset %>% group_by(strata_variable) %>% sample_n() # sample a specified number of people inside each segment
    • Cluster Sampling - cluster(dataset, cluster_var_name, number_to_select, method = “option”) # select everyone in each randomly select cluster
    • Systematic Sampling - every 5th or 10th or etc. person (implemented by custom functions)
    • Multi-Stage Sampling - combinations of 2+ of the above approaches in a sensible and structured manner

Randomized Complete Block Designs (RCBD):

  • RCBD is run when there is a potential nuisance factor in the data that might otherwise impact the results and conclusions
    • Randomized - treatment is assigned randomly inside each block
    • Complete - each treatment is used the same number of times inside each block
    • Block - experimental groups are blocked to be similar (differences within the group are expected to be lesser than differences across the groups)
    • Design - the experiment
    • “Block what you can, randomize what you cannot”
  • The library(agricolae) allows for drawing some of the experimental designs such as an RCBD
    • library(agricolae)
    • trt <- letters[1:4]
    • rep <- 4
    • design.rcbd <- design.rcbd(trt, r = rep, seed = 42, serie = 0) # serie has to do with tagging of number blocks
    • design.rcbd$sketch

Balanced Incomplete Block Designs (BIBD):

  • Incomplete blocaks are when you cannot fully fit a treatment inside a block
    • Balanced - each pair of treatments occur together in a block an equal number of times
    • Incomplete - not every treatment will appear in every block
    • Block - experimental groups are blocked to be similar (differences within the group are expected to be lesser than differences across the groups
    • Design - the experiment
  • Suppose that t is the number of treatments, k is the number of treatments per block, and r is the number of replications
    • lambda = r * (k - 1) / (t - 1)
    • If lambda is a whole number, then a BIBD is possible; otherwise, it is not

Example code includes:

nhanes_demo <- readr::read_csv("./RInputFiles/nhanes_demo.csv")
## Parsed with column specification:
## cols(
##   .default = col_double()
## )
## See spec(...) for full column specifications.
nhanes_medical <- readr::read_csv("./RInputFiles/nhanes_medicalconditions.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   mcq230d = col_logical(),
##   mcq240b = col_logical(),
##   mcq240c = col_logical(),
##   mcq240d = col_logical(),
##   mcq240dk = col_logical(),
##   mcq240h = col_logical(),
##   mcq240i = col_logical(),
##   mcq240k = col_logical(),
##   mcq240q = col_logical(),
##   mcq240r = col_logical(),
##   mcq240s = col_logical(),
##   mcq240v = col_logical(),
##   mcq240y = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 34 parsing failures.
##  row     col           expected actual                                         file
## 1510 mcq240s 1/0/T/F/TRUE/FALSE     41 './RInputFiles/nhanes_medicalconditions.csv'
## 1900 mcq240s 1/0/T/F/TRUE/FALSE     53 './RInputFiles/nhanes_medicalconditions.csv'
## 1982 mcq240q 1/0/T/F/TRUE/FALSE     80 './RInputFiles/nhanes_medicalconditions.csv'
## 3132 mcq240q 1/0/T/F/TRUE/FALSE     56 './RInputFiles/nhanes_medicalconditions.csv'
## 3452 mcq240c 1/0/T/F/TRUE/FALSE     69 './RInputFiles/nhanes_medicalconditions.csv'
## .... ....... .................. ...... ............................................
## See problems(...) for more details.
nhanes_bodymeasures <- readr::read_csv("./RInputFiles/nhanes_bodymeasures.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   bmihead = col_logical()
## )
## See spec(...) for full column specifications.
dummy_nhanes_final <- readr::read_csv("./RInputFiles/nhanes_final.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   mcq230d = col_logical(),
##   mcq240b = col_logical(),
##   mcq240c = col_logical(),
##   mcq240d = col_logical(),
##   mcq240dk = col_logical(),
##   mcq240h = col_logical(),
##   mcq240i = col_logical(),
##   mcq240k = col_logical(),
##   mcq240q = col_logical(),
##   mcq240r = col_logical(),
##   mcq240s = col_logical(),
##   mcq240v = col_logical(),
##   mcq240y = col_logical(),
##   bmxhead = col_logical(),
##   bmihead = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 33 parsing failures.
##  row     col           expected actual                             file
## 1443 mcq240s 1/0/T/F/TRUE/FALSE     41 './RInputFiles/nhanes_final.csv'
## 1819 mcq240s 1/0/T/F/TRUE/FALSE     53 './RInputFiles/nhanes_final.csv'
## 1897 mcq240q 1/0/T/F/TRUE/FALSE     80 './RInputFiles/nhanes_final.csv'
## 3000 mcq240q 1/0/T/F/TRUE/FALSE     56 './RInputFiles/nhanes_final.csv'
## 3892 mcq240s 1/0/T/F/TRUE/FALSE     25 './RInputFiles/nhanes_final.csv'
## .... ....... .................. ...... ................................
## See problems(...) for more details.
#merge the 3 datasets you just created to create nhanes_combined
nhanes_combined <- list(nhanes_demo, nhanes_medical, nhanes_bodymeasures) %>%
  Reduce(function(df1, df2) inner_join(df1, df2, by="seqn"), .)


#fill in the dplyr code
nhanes_combined %>% group_by(mcq365d) %>% summarise(mean = mean(bmxwt, na.rm = TRUE))
## # A tibble: 4 x 2
##   mcq365d  mean
##     <dbl> <dbl>
## 1       1  90.7
## 2       2  76.5
## 3       9  90.8
## 4      NA  33.5
#fill in the ggplot2 code
nhanes_combined %>% filter(ridageyr > 16) %>% 
  ggplot(aes(x=as.factor(mcq365d), y=bmxwt)) +
  geom_boxplot()
## Warning: Removed 70 rows containing non-finite values (stat_boxplot).

#filter out anyone less than 16
nhanes_filter <- nhanes_combined %>% filter(ridageyr > 16)

#use simputation & impute bmxwt to fill in missing values
nhanes_final <- simputation::impute_median(nhanes_filter, bmxwt ~ riagendr)

#recode mcq365d with ifelse() & examine with table()
nhanes_final$mcq365d <- ifelse(nhanes_final$mcq365d==9, 2, nhanes_final$mcq365d)
table(nhanes_final$mcq365d)
## 
##    1    2 
## 1802 4085
#use sample() to create nhanes_srs
nhanes_srs <- nhanes_final[sample(nrow(nhanes_final), 2500), ]

#create nhanes_stratified with group_by() and sample_n()
nhanes_stratified <- nhanes_final %>%
  group_by(riagendr) %>%
  sample_n(2000)
table(nhanes_stratified$riagendr)
## 
##    1    2 
## 2000 2000
#load sampling package and create nhanes_cluster with cluster()
nhanes_cluster <- sampling::cluster(nhanes_final, "indhhin2", 6, method = "srswor")


#use str() to view design.rcbd's criteria
str(agricolae::design.rcbd)
## function (trt, r, serie = 2, seed = 0, kinds = "Super-Duper", first = TRUE, 
##     continue = FALSE, randomization = TRUE)
#build trt and rep
trt <- LETTERS[1:5]
rep <- 4

#Use trt and rep to build my.design.rcbd and view the sketch part of the object
my_design_rcbd <- agricolae::design.rcbd(trt, r=rep, seed = 42, serie=0)
my_design_rcbd$sketch
##      [,1] [,2] [,3] [,4] [,5]
## [1,] "D"  "E"  "A"  "C"  "B" 
## [2,] "B"  "C"  "A"  "E"  "D" 
## [3,] "C"  "D"  "A"  "E"  "B" 
## [4,] "A"  "C"  "B"  "D"  "E"
#make nhanes_final$riagendr a factor variable
nhanes_final$riagendr <- factor(nhanes_final$riagendr)

#use aov() to create nhanes_rcbd
nhanes_rcbd <- aov(bmxwt ~ mcq365d + riagendr, data=nhanes_final)

#check the results of nhanes_rcbd with summary()
summary(nhanes_rcbd)
##               Df  Sum Sq Mean Sq F value Pr(>F)    
## mcq365d        1  229164  229164   571.2 <2e-16 ***
## riagendr       1  163069  163069   406.4 <2e-16 ***
## Residuals   5884 2360774     401                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#print the difference in weights by mcq365d and riagendr
nhanes_final %>% group_by(mcq365d, riagendr) %>% summarise(mean_wt = mean(bmxwt))
## # A tibble: 4 x 3
## # Groups:   mcq365d [2]
##   mcq365d riagendr mean_wt
##     <dbl> <fct>      <dbl>
## 1       1 1           95.2
## 2       1 2           86.6
## 3       2 1           82.7
## 4       2 2           71.3
#set up the 2x2 plotting grid and then plot nhanes_rcbd
par(mfrow=c(2, 2))
plot(nhanes_rcbd)

par(mfrow=c(1, 1))

#run the code to view the interaction plots
with(nhanes_final, interaction.plot(mcq365d, riagendr, bmxwt))

#run the code to view the interaction plots
with(nhanes_final, interaction.plot(riagendr, mcq365d, bmxwt))

#create my_design_bibd_1
# my_design_bibd_1 <- design.bib(LETTERS[1:3], k = 4, r = 16, serie = 0, seed = 42)  # will throw an error

#create my_design_bibd_2
# my_design_bibd_2 <- design.bib(letters[1:2], k = 3, r = 5, serie = 0, seed = 42)  # will throw warning

#create my_design_bibd_3
my_design_bibd_3 <- agricolae::design.bib(letters[1:4], k = 4, r = 6, serie = 0, seed = 42)
## 
## Parameters BIB
## ==============
## Lambda     : 6
## treatmeans : 4
## Block size : 4
## Blocks     : 6
## Replication: 6 
## 
## Efficiency factor 1 
## 
## <<< Book >>>
my_design_bibd_3$sketch
##      [,1] [,2] [,3] [,4]
## [1,] "d"  "b"  "a"  "c" 
## [2,] "d"  "c"  "b"  "a" 
## [3,] "c"  "d"  "b"  "a" 
## [4,] "a"  "b"  "d"  "c" 
## [5,] "b"  "d"  "a"  "c" 
## [6,] "a"  "b"  "d"  "c"
lambda <- function(t, k, r){
  return((r*(k-1)) / (t-1))
}

#calculate lambda
lambda(4, 3, 3)
## [1] 2
#build the data.frame
creatinine <- c(1.98, 1.97, 2.35, 2.09, 1.87, 1.95, 2.08, 2.01, 1.84, 2.06, 1.97, 2.22)
food <- as.factor(c("A", "C", "D", "A", "B", "C", "B", "C", "D", "A", "B", "D"))
color <- as.factor(rep(c("Black", "White", "Orange", "Spotted"), each = 3))
cat_experiment <- as.data.frame(cbind(creatinine, food, color))

#create cat_model & then wrong_cat_model and examine them with summary()
cat_model <- aov(creatinine ~ food + color, data=cat_experiment)
summary(cat_model)
##             Df  Sum Sq  Mean Sq F value Pr(>F)
## food         1 0.01204 0.012042   0.530  0.485
## color        1 0.00697 0.006971   0.307  0.593
## Residuals    9 0.20461 0.022735
#calculate lambda
lambda(3, 3, 2)
## [1] 2
#create weightlift_model & examine results (variable does not exist in dataset)
# weightlift_model <- aov(bmxarmc ~ weightlift_treat + ridreth1, data=nhanes_final)
# summary(weightlift_model)

Chapter 4 - Latin Squares, Graeco-Latin Squares, Factorial Experiments

Latin Squares have two blocking factors, assumed not to interact with each other or the treatment, and each with the same number of levels:

  • Latin squares can be analyzed just like an RCBD
  • In a Latin square, both the rows and the columns are the blocking factors
  • Can use nyc_scores dataset containing reading, writing, and math scores from all accredited high schools
    • Goal is to assess the impact of a (fabricated) tutoring program on the scores by school

Graeco-Latin Squares builds on Latin squares by adding an additional blocking factor:

  • Three blocking factors, all with the same number of levels, and assumed not to interact with each other or the treatment
    • Greek letters added next to the Latin letters indicate the third blocking factors (can use Latin and numbers instead)
    • All of the combinations occur only once (each letter once per row/column, and each number once per letter)

Factorial Experiments - designs in which 2+ variables are crossed in an experiment, with each combination considered a factor:

  • Example of testing all combinations of high/low water and high/low light - each combination is tested, with TukeyHSD() applied after
  • This course will focus on 2^k factor experiments, meaning that each level has only a High/Low (or similar) possibility

Next steps:

  • Many other types of factorial designs - do not all need to be 2**k, with many factor levels
    • Might consider a fractional factorial design to minimize the analytical burden
  • Design should be a valued and integrated part of the process
  • There will always be some unmeasured confounders, but good design can help to reduce that noise

Example code includes:

nyc_scores <- readr::read_csv("./RInputFiles/nyc_scores.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   School_ID = col_character(),
##   School_Name = col_character(),
##   Borough = col_character(),
##   Building_Code = col_character(),
##   Street_Address = col_character(),
##   City = col_character(),
##   State = col_character(),
##   Phone_Number = col_character(),
##   Start_Time = col_time(format = ""),
##   End_Time = col_time(format = "")
## )
## See spec(...) for full column specifications.
glimpse(nyc_scores)
## Observations: 435
## Variables: 22
## $ School_ID                 <chr> "02M260", "06M211", "01M539", "02M29...
## $ School_Name               <chr> "Clinton School Writers and Artists"...
## $ Borough                   <chr> "Manhattan", "Manhattan", "Manhattan...
## $ Building_Code             <chr> "M933", "M052", "M022", "M445", "M44...
## $ Street_Address            <chr> "425 West 33rd Street", "650 Academy...
## $ City                      <chr> "Manhattan", "Manhattan", "Manhattan...
## $ State                     <chr> "NY", "NY", "NY", "NY", "NY", "NY", ...
## $ Zip_Code                  <dbl> 10001, 10002, 10002, 10002, 10002, 1...
## $ Latitude                  <dbl> 40.75321, 40.86605, 40.71873, 40.716...
## $ Longitude                 <dbl> -73.99786, -73.92486, -73.97943, -73...
## $ Phone_Number              <chr> "212-695-9114", "718-935-3660", "212...
## $ Start_Time                <drtn>       NA, 08:30:00, 08:15:00, 08:00...
## $ End_Time                  <drtn>       NA, 15:00:00, 16:00:00, 14:45...
## $ Student_Enrollment        <dbl> NA, 87, 1735, 358, 383, 416, 255, 54...
## $ Percent_White             <dbl> NA, 0.03, 0.29, 0.12, 0.03, 0.02, 0....
## $ Percent_Black             <dbl> NA, 0.22, 0.13, 0.39, 0.28, 0.03, 0....
## $ Percent_Hispanic          <dbl> NA, 0.68, 0.18, 0.41, 0.57, 0.06, 0....
## $ Percent_Asian             <dbl> NA, 0.05, 0.39, 0.06, 0.09, 0.89, 0....
## $ Average_Score_SAT_Math    <dbl> NA, NA, 657, 395, 418, 613, 410, 634...
## $ Average_Score_SAT_Reading <dbl> NA, NA, 601, 411, 428, 453, 406, 641...
## $ Average_Score_SAT_Writing <dbl> NA, NA, 601, 387, 415, 463, 381, 639...
## $ Percent_Tested            <dbl> NA, NA, 0.91, 0.79, 0.65, 0.96, 0.60...
tEL <- c('PhD', 'BA', 'BA', 'MA', 'MA', 'PhD', 'MA', 'MA', 'BA', 'PhD', 'College Student', 'College Student', 'Grad Student', 'MA', 'MA', 'MA', 'BA', 'MA', 'BA', 'MA', 'College Student', 'PhD', 'MA', 'MA', 'BA', 'MA', 'College Student', 'BA', 'PhD', 'Grad Student', 'MA', 'Grad Student', 'MA', 'College Student', 'Grad Student', 'MA', 'Grad Student', 'BA', 'BA', 'College Student', 'Grad Student', 'College Student', 'BA', 'BA', 'PhD', 'BA', 'Grad Student', 'Grad Student', 'College Student', 'College Student', 'BA', 'PhD', 'College Student', 'PhD', 'PhD', 'PhD', 'College Student', 'Grad Student', 'MA', 'MA', 'BA', 'PhD', 'College Student', 'MA', 'MA', 'College Student', 'Grad Student', 'MA', 'PhD', 'MA', 'College Student', 'MA', 'PhD', 'MA', 'College Student', 'College Student', 'Grad Student', 'PhD', 'MA', 'MA', 'Grad Student', 'MA', 'MA', 'Grad Student', 'PhD', 'Grad Student', 'Grad Student', 'Grad Student', 'MA', 'PhD', 'BA', 'MA', 'Grad Student', 'BA', 'College Student', 'MA', 'College Student', 'Grad Student', 'Grad Student', 'College Student', 'MA', 'BA', 'BA', 'MA', 'MA', 'Grad Student', 'MA', 'Grad Student', 'MA', 'Grad Student', 'College Student', 'College Student', 'College Student', 'MA', 'BA', 'Grad Student', 'Grad Student', 'MA', 'College Student', 'BA', 'Grad Student', 'MA', 'Grad Student', 'PhD', 'MA', 'MA', 'College Student', 'MA', 'College Student', 'PhD', 'College Student', 'MA', 'MA', 'MA', 'MA', 'College Student', 'MA', 'BA', 'MA', 'Grad Student', 'BA', 'MA', 'MA', 'Grad Student', 'MA', 'MA', 'College Student', 'MA', 'MA', 'BA', 'MA', 'College Student', 'Grad Student', 'College Student', 'MA', 'BA', 'MA', 'BA', 'College Student', 'Grad Student', 'Grad Student', 'Grad Student', 'Grad Student', 'Grad Student', 'MA', 'BA', 'MA', 'BA', 'College Student', 'MA', 'BA', 'MA', 'Grad Student', 'MA', 'PhD', 'MA', 'BA', 'Grad Student', 'MA', 'BA', 'BA', 'MA', 'BA', 'College Student', 'BA', 'MA', 'MA', 'BA', 'MA', 'College Student', 'BA', 'Grad Student', 'MA', 'BA', 'MA', 'MA', 'MA', 'BA', 'College Student', 'College Student')
tEL <- c(tEL, 'BA', 'Grad Student', 'BA', 'BA', 'MA', 'Grad Student', 'BA', 'MA', 'BA', 'PhD', 'MA', 'MA', 'MA', 'BA', 'College Student', 'PhD', 'BA', 'Grad Student', 'BA', 'College Student', 'BA', 'MA', 'College Student', 'MA', 'College Student', 'Grad Student', 'College Student', 'MA', 'PhD', 'BA', 'PhD', 'Grad Student', 'BA', 'BA', 'MA', 'MA', 'BA', 'PhD', 'College Student', 'MA', 'BA', 'College Student', 'BA', 'MA', 'College Student', 'MA', 'College Student', 'BA', 'MA', 'BA', 'BA', 'MA', 'PhD', 'BA', 'MA', 'Grad Student', 'College Student', 'MA', 'College Student', 'MA', 'MA', 'PhD', 'College Student', 'College Student', 'Grad Student', 'Grad Student', 'MA', 'College Student', 'Grad Student', 'Grad Student', 'Grad Student', 'MA', 'Grad Student', 'MA', 'BA', 'College Student', 'MA', 'Grad Student', 'College Student', 'MA', 'BA', 'BA', 'College Student', 'College Student', 'College Student', 'College Student', 'College Student', 'PhD', 'MA', 'College Student', 'MA', 'MA', 'MA', 'PhD', 'College Student', 'College Student', 'MA', 'MA', 'MA', 'PhD', 'MA', 'MA', 'PhD', 'MA', 'Grad Student', 'MA', 'Grad Student', 'MA', 'Grad Student', 'MA', 'MA', 'PhD', 'BA', 'BA', 'Grad Student', 'Grad Student', 'PhD', 'BA', 'BA', 'Grad Student', 'College Student', 'BA', 'College Student', 'MA', 'MA', 'MA', 'Grad Student', 'BA', 'BA', 'MA', 'Grad Student', 'PhD', 'BA', 'Grad Student', 'Grad Student', 'Grad Student', 'BA', 'MA', 'BA', 'College Student', 'College Student', 'Grad Student', 'MA', 'Grad Student', 'Grad Student', 'BA', 'BA', 'MA', 'College Student', 'BA', 'Grad Student', 'Grad Student', 'College Student', 'Grad Student', 'College Student', 'PhD', 'BA', 'MA', 'MA', 'BA', 'College Student', 'College Student', 'PhD', 'MA', 'BA', 'MA', 'MA', 'Grad Student', 'MA', 'PhD', 'MA', 'MA', 'Grad Student', 'College Student', 'MA', 'BA', 'BA', 'College Student', 'Grad Student', 'BA', 'MA', 'MA', 'Grad Student', 'BA', 'Grad Student', 'Grad Student', 'MA', 'PhD', 'Grad Student', 'Grad Student', 'MA', 'MA', 'PhD', 'College Student', 'College Student', 'MA', 'BA', 'MA', 'College Student', 'MA', 'PhD', 'BA', 'MA', 'College Student', 'PhD', 'PhD', 'College Student', 'MA', 'MA', 'MA', 'PhD', 'MA', 'BA', 'College Student', 'BA', 'BA', 'MA', 'MA', 'College Student', 'College Student', 'Grad Student', 'College Student', 'MA', 'MA', 'MA', 'Grad Student', 'MA', 'College Student', 'Grad Student', 'BA', 'Grad Student', 'BA', 'MA', 'College Student', 'MA')


nyc_scores <- nyc_scores %>%
    mutate(Teacher_Education_Level=tEL)
glimpse(nyc_scores)
## Observations: 435
## Variables: 23
## $ School_ID                 <chr> "02M260", "06M211", "01M539", "02M29...
## $ School_Name               <chr> "Clinton School Writers and Artists"...
## $ Borough                   <chr> "Manhattan", "Manhattan", "Manhattan...
## $ Building_Code             <chr> "M933", "M052", "M022", "M445", "M44...
## $ Street_Address            <chr> "425 West 33rd Street", "650 Academy...
## $ City                      <chr> "Manhattan", "Manhattan", "Manhattan...
## $ State                     <chr> "NY", "NY", "NY", "NY", "NY", "NY", ...
## $ Zip_Code                  <dbl> 10001, 10002, 10002, 10002, 10002, 1...
## $ Latitude                  <dbl> 40.75321, 40.86605, 40.71873, 40.716...
## $ Longitude                 <dbl> -73.99786, -73.92486, -73.97943, -73...
## $ Phone_Number              <chr> "212-695-9114", "718-935-3660", "212...
## $ Start_Time                <drtn>       NA, 08:30:00, 08:15:00, 08:00...
## $ End_Time                  <drtn>       NA, 15:00:00, 16:00:00, 14:45...
## $ Student_Enrollment        <dbl> NA, 87, 1735, 358, 383, 416, 255, 54...
## $ Percent_White             <dbl> NA, 0.03, 0.29, 0.12, 0.03, 0.02, 0....
## $ Percent_Black             <dbl> NA, 0.22, 0.13, 0.39, 0.28, 0.03, 0....
## $ Percent_Hispanic          <dbl> NA, 0.68, 0.18, 0.41, 0.57, 0.06, 0....
## $ Percent_Asian             <dbl> NA, 0.05, 0.39, 0.06, 0.09, 0.89, 0....
## $ Average_Score_SAT_Math    <dbl> NA, NA, 657, 395, 418, 613, 410, 634...
## $ Average_Score_SAT_Reading <dbl> NA, NA, 601, 411, 428, 453, 406, 641...
## $ Average_Score_SAT_Writing <dbl> NA, NA, 601, 387, 415, 463, 381, 639...
## $ Percent_Tested            <dbl> NA, NA, 0.91, 0.79, 0.65, 0.96, 0.60...
## $ Teacher_Education_Level   <chr> "PhD", "BA", "BA", "MA", "MA", "PhD"...
#mean, var, and median of Math score
nyc_scores %>%
    group_by(Borough) %>% 
    summarise(mean = mean(Average_Score_SAT_Math, na.rm=TRUE),
              var = var(Average_Score_SAT_Math, na.rm=TRUE),
              median = median(Average_Score_SAT_Math, na.rm=TRUE))
## # A tibble: 5 x 4
##   Borough        mean   var median
##   <chr>         <dbl> <dbl>  <dbl>
## 1 Bronx          404. 2727.   396.
## 2 Brooklyn       416. 3658.   395 
## 3 Manhattan      456. 7026.   433 
## 4 Queens         462. 5168.   448 
## 5 Staten Island  486. 6911.   466.
#mean, var, and median of Math score
nyc_scores %>%
    group_by(Teacher_Education_Level) %>% 
    summarise(mean = mean(Average_Score_SAT_Math, na.rm=TRUE),
              var = var(Average_Score_SAT_Math, na.rm=TRUE),
              median = median(Average_Score_SAT_Math, na.rm=TRUE))
## # A tibble: 5 x 4
##   Teacher_Education_Level  mean   var median
##   <chr>                   <dbl> <dbl>  <dbl>
## 1 BA                       438. 5536.   418 
## 2 College Student          424. 4807.   400.
## 3 Grad Student             437. 7071.   410 
## 4 MA                       432. 4304.   415 
## 5 PhD                      435. 4869.   420.
#mean, var, and median of Math score
nyc_scores %>%
    group_by(Borough, Teacher_Education_Level) %>% 
    summarise(mean = mean(Average_Score_SAT_Math, na.rm=TRUE),
              var = var(Average_Score_SAT_Math, na.rm=TRUE),
              median = median(Average_Score_SAT_Math, na.rm=TRUE))
## # A tibble: 24 x 5
## # Groups:   Borough [5]
##    Borough  Teacher_Education_Level  mean   var median
##    <chr>    <chr>                   <dbl> <dbl>  <dbl>
##  1 Bronx    BA                       428. 7446.   407 
##  2 Bronx    College Student          384.  493.   382 
##  3 Bronx    Grad Student             400. 1776.   397 
##  4 Bronx    MA                       401.  901.   395 
##  5 Bronx    PhD                      400. 2468.   386 
##  6 Brooklyn BA                       425. 4005.   405 
##  7 Brooklyn College Student          396. 3035.   382 
##  8 Brooklyn Grad Student             436. 5756.   408 
##  9 Brooklyn MA                       414. 2401.   395 
## 10 Brooklyn PhD                      402  1868.   394.
## # ... with 14 more rows
# If we want to use SAT scores as our outcome, we need to examine their missingness
# First, look at the pattern of missingness using md.pattern() from the mice package
# There are 60 scores missing in each of the scores
# There are many R packages which help with more advanced forms of imputation, such as MICE, Amelia, mi, and more
# We will use the simputation andimpute_median() as we did previously

#examine missingness with md.pattern()
mice::md.pattern(nyc_scores)

##     School_ID School_Name Borough Building_Code Street_Address City State
## 374         1           1       1             1              1    1     1
## 11          1           1       1             1              1    1     1
## 42          1           1       1             1              1    1     1
## 4           1           1       1             1              1    1     1
## 1           1           1       1             1              1    1     1
## 3           1           1       1             1              1    1     1
##             0           0       0             0              0    0     0
##     Zip_Code Latitude Longitude Phone_Number Teacher_Education_Level
## 374        1        1         1            1                       1
## 11         1        1         1            1                       1
## 42         1        1         1            1                       1
## 4          1        1         1            1                       1
## 1          1        1         1            1                       1
## 3          1        1         1            1                       1
##            0        0         0            0                       0
##     Start_Time End_Time Student_Enrollment Percent_White Percent_Black
## 374          1        1                  1             1             1
## 11           1        1                  1             1             1
## 42           1        1                  1             1             1
## 4            1        1                  0             0             0
## 1            0        0                  1             1             1
## 3            0        0                  0             0             0
##              4        4                  7             7             7
##     Percent_Hispanic Percent_Asian Percent_Tested Average_Score_SAT_Math
## 374                1             1              1                      1
## 11                 1             1              1                      0
## 42                 1             1              0                      0
## 4                  0             0              0                      0
## 1                  1             1              1                      1
## 3                  0             0              0                      0
##                    7             7             49                     60
##     Average_Score_SAT_Reading Average_Score_SAT_Writing    
## 374                         1                         1   0
## 11                          0                         0   3
## 42                          0                         0   4
## 4                           0                         0   9
## 1                           1                         1   2
## 3                           0                         0  11
##                            60                        60 272
#impute the Math, Writing, and Reading scores by Borough
nyc_scores_2 <- simputation::impute_median(nyc_scores, Average_Score_SAT_Math ~ Borough)

#convert Math score to numeric
nyc_scores_2$Average_Score_SAT_Math <- as.numeric(nyc_scores_2$Average_Score_SAT_Math)

#examine scores by Borough in both datasets, before and after imputation
nyc_scores %>% 
  group_by(Borough) %>% 
  summarise(median = median(Average_Score_SAT_Math, na.rm = TRUE), mean = mean(Average_Score_SAT_Math, na.rm = TRUE))
## # A tibble: 5 x 3
##   Borough       median  mean
##   <chr>          <dbl> <dbl>
## 1 Bronx           396.  404.
## 2 Brooklyn        395   416.
## 3 Manhattan       433   456.
## 4 Queens          448   462.
## 5 Staten Island   466.  486.
nyc_scores_2 %>% 
  group_by(Borough) %>% 
  summarise(median = median(Average_Score_SAT_Math, na.rm = TRUE), mean = mean(Average_Score_SAT_Math, na.rm = TRUE))
## # A tibble: 5 x 3
##   Borough       median  mean
##   <chr>          <dbl> <dbl>
## 1 Bronx           396.  403.
## 2 Brooklyn        395   414.
## 3 Manhattan       433   452.
## 4 Queens          448   460.
## 5 Staten Island   466.  486.
#design a LS with 5 treatments A:E then look at the sketch
my_design_lsd <- agricolae::design.lsd(LETTERS[1:5], serie=0, seed=42)
my_design_lsd$sketch
##      [,1] [,2] [,3] [,4] [,5]
## [1,] "B"  "E"  "D"  "A"  "C" 
## [2,] "A"  "D"  "C"  "E"  "B" 
## [3,] "E"  "C"  "B"  "D"  "A" 
## [4,] "C"  "A"  "E"  "B"  "D" 
## [5,] "D"  "B"  "A"  "C"  "E"
# To execute a Latin Square design on this data, suppose we want to know the effect of of our tutoring program, which includes one-on-one tutoring, two small groups, and an in and after school SAT prep class
# A new dataset nyc_scores_ls is available that represents this experiment. Feel free to explore the dataset in the console.

# We'll block by Borough and Teacher_Education_Level to reduce their known variance on the score outcome
# Borough is a good blocking factor because schools in America are funded partly based on taxes paid in each city, so it will likely make a difference on quality of education

lsID <- c('11X290', '10X342', '09X260', '09X412', '12X479', '14K478', '32K554', '14K685', '22K405', '17K382', '05M692', '02M427', '02M308', '03M402', '02M282', '30Q501', '26Q495', '24Q455', '29Q326', '25Q670', '31R450', '31R445', '31R080', '31R460', '31R455')
lsTP <- c('One-on-One', 'Small Groups (2-3)', 'Small Groups (4-6)', 'SAT Prep Class (school hours)', 'SAT Prep Class (after school)', 'Small Groups (2-3)', 'Small Groups (4-6)', 'SAT Prep Class (school hours)', 'SAT Prep Class (after school)', 'One-on-One', 'Small Groups (4-6)', 'SAT Prep Class (school hours)', 'SAT Prep Class (after school)', 'One-on-One', 'Small Groups (2-3)', 'SAT Prep Class (school hours)', 'SAT Prep Class (after school)', 'One-on-One', 'Small Groups (2-3)', 'Small Groups (4-6)', 'SAT Prep Class (after school)', 'One-on-One', 'Small Groups (2-3)', 'Small Groups (4-6)', 'SAT Prep Class (school hours)')

nyc_scores_ls <- nyc_scores_2 %>%
    filter(School_ID %in% lsID) %>%
    mutate(Tutoring_Program=lsTP)


#build nyc_scores_ls_lm
nyc_scores_ls_lm <- lm(Average_Score_SAT_Math ~ Tutoring_Program + Borough + Teacher_Education_Level,
                       data=nyc_scores_ls
                       )

#tidy the results with broom
nyc_scores_ls_lm %>% broom::tidy()
## # A tibble: 13 x 5
##    term                                estimate std.error statistic p.value
##    <chr>                                  <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)                           398.        70.6    5.64   1.09e-4
##  2 Tutoring_ProgramSAT Prep Class (af~    -3.23      49.7   -0.0650 9.49e-1
##  3 Tutoring_ProgramSAT Prep Class (sc~    17.7       63.3    0.279  7.85e-1
##  4 Tutoring_ProgramSmall Groups (2-3)    -46.0       58.5   -0.787  4.47e-1
##  5 Tutoring_ProgramSmall Groups (4-6)    -48.4       49.7   -0.973  3.50e-1
##  6 BoroughBrooklyn                        59.9       59.5    1.01   3.34e-1
##  7 BoroughManhattan                       55.5       53.0    1.05   3.16e-1
##  8 BoroughQueens                          84.5       59.1    1.43   1.78e-1
##  9 BoroughStaten Island                   54.0       49.2    1.10   2.93e-1
## 10 Teacher_Education_LevelCollege Stu~    16.3       81.0    0.201  8.44e-1
## 11 Teacher_Education_LevelGrad Student    71.9       64.3    1.12   2.86e-1
## 12 Teacher_Education_LevelMA               8.12      46.5    0.175  8.64e-1
## 13 Teacher_Education_LevelPhD            -68.1      102.    -0.671  5.15e-1
#examine the results with anova
nyc_scores_ls_lm %>% anova()
## Analysis of Variance Table
## 
## Response: Average_Score_SAT_Math
##                         Df Sum Sq Mean Sq F value Pr(>F)
## Tutoring_Program         4  42101 10525.3  1.8387 0.1863
## Borough                  4  13665  3416.3  0.5968 0.6719
## Teacher_Education_Level  4  15046  3761.6  0.6571 0.6332
## Residuals               12  68693  5724.4
#create a boxplot of Math scores by Borough, with a title and x/y axis labels
ggplot(nyc_scores, aes(x=Borough, y=Average_Score_SAT_Math)) + 
  geom_boxplot() + 
  ggtitle("Average SAT Math Scores by Borough, NYC") + 
  xlab("Borough (NYC)") + 
  ylab("Average SAT Math Scores (2014-15)")
## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

#create trt1 and trt2
trt1 <- LETTERS[1:5]
trt2 <- 1:5

#create my_graeco_design
my_graeco_design <- agricolae::design.graeco(trt1, trt2, serie=0, seed=42)

#examine the parameters and sketch
my_graeco_design$parameters
## $design
## [1] "graeco"
## 
## $trt1
## [1] "A" "B" "C" "D" "E"
## 
## $trt2
## [1] 1 2 3 4 5
## 
## $r
## [1] 5
## 
## $serie
## [1] 0
## 
## $seed
## [1] 42
## 
## $kinds
## [1] "Super-Duper"
## 
## [[8]]
## [1] TRUE
my_graeco_design$sketch
##      [,1]  [,2]  [,3]  [,4]  [,5] 
## [1,] "D 2" "E 3" "A 1" "C 5" "B 4"
## [2,] "E 1" "A 5" "C 4" "B 2" "D 3"
## [3,] "A 4" "C 2" "B 3" "D 1" "E 5"
## [4,] "C 3" "B 1" "D 5" "E 4" "A 2"
## [5,] "B 5" "D 4" "E 2" "A 3" "C 1"
glsID <- c('09X241', '10X565', '09X260', '07X259', '11X455', '18K563', '23K697', '32K403', '22K425', '16K688', '02M135', '06M348', '02M419', '02M489', '04M495', '30Q502', '24Q530', '30Q555', '24Q560', '27Q650', '31R440', '31R064', '31R450', '31R445', '31R460')
glsTP <- c('SAT Prep Class (school hours)', 'SAT Prep Class (after school)', 'One-on-One', 'Small Groups (4-6)', 'Small Groups (2-3)', 'SAT Prep Class (after school)', 'One-on-One', 'Small Groups (4-6)', 'Small Groups (2-3)', 'SAT Prep Class (school hours)', 'One-on-One', 'Small Groups (4-6)', 'Small Groups (2-3)', 'SAT Prep Class (school hours)', 'SAT Prep Class (after school)', 'Small Groups (4-6)', 'Small Groups (2-3)', 'SAT Prep Class (school hours)', 'SAT Prep Class (after school)', 'One-on-One', 'Small Groups (2-3)', 'SAT Prep Class (school hours)', 'SAT Prep Class (after school)', 'One-on-One', 'Small Groups (4-6)')
glsHT <- c('Small Group', 'Large Group', 'Individual', 'Mix of Large Group/Individual', 'Mix of Small Group/Individual', 'Individual', 'Mix of Large Group/Individual', 'Mix of Small Group/Individual', 'Small Group', 'Large Group', 'Mix of Small Group/Individual', 'Small Group', 'Large Group', 'Individual', 'Mix of Large Group/Individual', 'Large Group', 'Individual', 'Mix of Large Group/Individual', 'Mix of Small Group/Individual', 'Small Group', 'Mix of Large Group/Individual', 'Mix of Small Group/Individual', 'Small Group', 'Large Group', 'Individual')


nyc_scores_gls <- nyc_scores_2 %>%
    filter(School_ID %in% glsID) %>%
    mutate(Tutoring_Program=glsTP, Homework_Type=glsHT)


#build nyc_scores_gls_lm
nyc_scores_gls_lm <- lm(Average_Score_SAT_Math ~ Tutoring_Program + Borough + Teacher_Education_Level + Homework_Type, data=nyc_scores_gls)

#tidy the results with broom
nyc_scores_gls_lm %>% broom::tidy()
## # A tibble: 17 x 5
##    term                              estimate std.error statistic   p.value
##    <chr>                                <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                         376.        32.1   11.7      2.61e-6
##  2 Tutoring_ProgramSAT Prep Class (~    43.4       27.2    1.60     1.49e-1
##  3 Tutoring_ProgramSAT Prep Class (~    28.2       29.3    0.960    3.65e-1
##  4 Tutoring_ProgramSmall Groups (2-~    32.7       27.3    1.20     2.66e-1
##  5 Tutoring_ProgramSmall Groups (4-~    56.1       27.5    2.04     7.51e-2
##  6 BoroughBrooklyn                     -21.5       28.9   -0.744    4.78e-1
##  7 BoroughManhattan                     13.5       28.0    0.484    6.42e-1
##  8 BoroughQueens                        41.8       24.8    1.68     1.31e-1
##  9 BoroughStaten Island                 25.1       22.9    1.10     3.05e-1
## 10 Teacher_Education_LevelCollege S~     2.61      30.6    0.0853   9.34e-1
## 11 Teacher_Education_LevelGrad Stud~    40.2       40.1    1.00     3.46e-1
## 12 Teacher_Education_LevelMA            18.0       22.9    0.786    4.55e-1
## 13 Teacher_Education_LevelPhD           14.2       36.3    0.392    7.05e-1
## 14 Homework_TypeLarge Group             -1.07      25.8   -0.0415   9.68e-1
## 15 Homework_TypeMix of Large Group/~   -15.6       26.4   -0.592    5.70e-1
## 16 Homework_TypeMix of Small Group/~     3.28      25.0    0.131    8.99e-1
## 17 Homework_TypeSmall Group             47.8       28.3    1.69     1.29e-1
#examine the results with anova
nyc_scores_gls_lm %>% anova()
## Analysis of Variance Table
## 
## Response: Average_Score_SAT_Math
##                         Df  Sum Sq Mean Sq F value  Pr(>F)  
## Tutoring_Program         4 15371.5  3842.9  3.1570 0.07801 .
## Borough                  4  5277.0  1319.3  1.0838 0.42563  
## Teacher_Education_Level  4  2869.7   717.4  0.5894 0.67993  
## Homework_Type            4  9738.7  2434.7  2.0002 0.18747  
## Residuals                8  9737.9  1217.2                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pctTHL <- c(1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2)
pctBHL <- c(2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1)
tP <- c('Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'No')

nyc_scores <- nyc_scores %>%
    select(-Teacher_Education_Level) %>%
    mutate(Percent_Tested_HL=factor(pctTHL), Percent_Black_HL=factor(pctBHL), Tutoring_Program=factor(tP))


#build the boxplots for all 3 factor variables: tutoring program, pct black, pct tested
ggplot(nyc_scores, aes(x=Tutoring_Program, y=Average_Score_SAT_Math)) + 
    geom_boxplot()
## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

ggplot(nyc_scores, aes(x=Percent_Black_HL, y=Average_Score_SAT_Math)) + 
    geom_boxplot()
## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

ggplot(nyc_scores, aes(x=Percent_Tested_HL, y=Average_Score_SAT_Math)) + 
    geom_boxplot()
## Warning: Removed 60 rows containing non-finite values (stat_boxplot).

#create nyc_scores_factorial and examine the results
nyc_scores_factorial <- aov(Average_Score_SAT_Math ~ Percent_Tested_HL * Percent_Black_HL * Tutoring_Program, data=nyc_scores)
broom::tidy(nyc_scores_factorial)
## # A tibble: 8 x 6
##   term                                df   sumsq meansq statistic   p.value
##   <chr>                            <dbl>   <dbl>  <dbl>     <dbl>     <dbl>
## 1 Percent_Tested_HL                    1  3.92e5 3.92e5   116.     1.21e-23
## 2 Percent_Black_HL                     1  1.87e5 1.87e5    55.3    7.52e-13
## 3 Tutoring_Program                     1  9.42e3 9.42e3     2.78   9.63e- 2
## 4 Percent_Tested_HL:Percent_Black~     1  8.88e4 8.88e4    26.2    4.94e- 7
## 5 Percent_Tested_HL:Tutoring_Prog~     1  1.98e3 1.98e3     0.584  4.45e- 1
## 6 Percent_Black_HL:Tutoring_Progr~     1  5.16e3 5.16e3     1.52   2.18e- 1
## 7 Percent_Tested_HL:Percent_Black~     1  7.77e3 7.77e3     2.29   1.31e- 1
## 8 Residuals                          367  1.24e6 3.39e3    NA     NA
#use shapiro.test() to test the outcome
shapiro.test(nyc_scores$Average_Score_SAT_Math)
## 
##  Shapiro-Wilk normality test
## 
## data:  nyc_scores$Average_Score_SAT_Math
## W = 0.84672, p-value < 2.2e-16
#plot nyc_scores_factorial to examine residuals
par(mfrow = c(2, 2))
plot(nyc_scores_factorial)

par(mfrow = c(1, 1))

Structural Equation Modeling with lavaan in R

Chapter 1 - One-Factor Models

Model Specification - Structural Equation Models (SEM) - explore relationships between variables:

  • Can confirm the structure of a developed model also
  • Two variable types - manifest (directly measured) which are represented by squares, and latent (abstract underlying phenomenon) represented as circles
    • The manifest variables are assumed to be driven by the latent variables (such as intelligence)
  • Can set up an analysis in R using lavaan based on 1939 intelligence data
    • library(lavaan)
    • data(HolzingerSwineford1939)
    • example model <- ‘latent_variable =~ manifest_variable1 + manifest_variable2 + …’ # latent_variable can have any name not in dataset, =~ is direction of prediction
    • visual.model <- ‘visual =~ x1 + x2 + x3 + x7 + x8 + x9’ # x1, x2, x3, x7, x8, x9 are visual components inside the 1939 dataset

Model Analysis:

  • Degrees of freedom are based on df = Possible Values - Estimated Values
    • Possible Values = Manifest Variables * (Manifest Variables + 1) / 2
    • Models need to have at least 3 manifest variables and df > 0
    • Can use scaling and constraints to control degrees of freedom - managed inside lavaan but can modify defaults
  • Can run the models using lavaan in R
    • visual.model <- ‘visual =~ x1 + x2 + x3 + x7 + x8 + x9’
    • visual.fit <- cfa(model = visual.model, data = HolzingerSwineford1939) # include the previously defined model and the data frame
    • summary(visual.fit) # basic information about the model
    • The loadings (weightings) for each of the manifest variables will be shown, typically with the first coefficient set to 1 as per the scaling
    • The variance estimates are also provided for each of the variables - should be positive, but can be negative (needs to be troubleshot)

Model Assessment:

  • Standardized loadings measure the strength of the relationships between the manifest variables and the latent variables
    • Can be measured based on the estimates, relative to the variable that was set as 1.00
    • Can instead use the standardized solution based on the z-scores
    • summary(visual.fit, standardized = TRUE) # to get the standardized solution (Std.all column, with close to 1 being best; Std.lv being scaled like the loading variable)
  • The model fit measures how well the data fit the specified model
    • Goodness of fit indices like the Comparative Fit Index or the Tucker Lewis Index - goal is closer to 1 and 0.9+
    • Badness of fit indices like RMSE Approximation or Standardized Root Mean Square Residual (SRMR) - goal is lower and 0.1-
    • summary(visual.fit, standardized = TRUE, fit.measures = TRUE) # will show most common fit meaasures

Example code includes:

#Load the lavaan library
library(lavaan)
## This is lavaan 0.6-3
## lavaan is BETA software! Please report any bugs.
#Look at the dataset
data(HolzingerSwineford1939, package="lavaan")
head(HolzingerSwineford1939[ , 7:15])
##         x1   x2    x3       x4   x5        x6       x7   x8       x9
## 1 3.333333 7.75 0.375 2.333333 5.75 1.2857143 3.391304 5.75 6.361111
## 2 5.333333 5.25 2.125 1.666667 3.00 1.2857143 3.782609 6.25 7.916667
## 3 4.500000 5.25 1.875 1.000000 1.75 0.4285714 3.260870 3.90 4.416667
## 4 5.333333 7.75 3.000 2.666667 4.50 2.4285714 3.000000 5.30 4.861111
## 5 4.833333 4.75 0.875 2.666667 4.00 2.5714286 3.695652 6.30 5.916667
## 6 5.333333 5.00 2.250 1.000000 3.00 0.8571429 4.347826 6.65 7.500000
#Define your model specification
text.model <- "textspeed =~ x4 + x5 + x6 + x7 + x8 + x9"

#Analyze the model with cfa()
text.fit <- lavaan::cfa(model=text.model, data=HolzingerSwineford1939)

#Summarize the model
summary(text.fit)
## lavaan 0.6-3 ended normally after 20 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         12
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     149.786
##   Degrees of freedom                                 9
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   textspeed =~                                        
##     x4                1.000                           
##     x5                1.130    0.067   16.946    0.000
##     x6                0.925    0.056   16.424    0.000
##     x7                0.196    0.067    2.918    0.004
##     x8                0.186    0.062    2.984    0.003
##     x9                0.279    0.062    4.539    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x4                0.383    0.048    7.903    0.000
##    .x5                0.424    0.059    7.251    0.000
##    .x6                0.368    0.044    8.419    0.000
##    .x7                1.146    0.094   12.217    0.000
##    .x8                0.988    0.081   12.215    0.000
##    .x9                0.940    0.077   12.142    0.000
##     textspeed         0.968    0.112    8.647    0.000
summary(text.fit, standardized=TRUE)
## lavaan 0.6-3 ended normally after 20 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         12
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     149.786
##   Degrees of freedom                                 9
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   textspeed =~                                                          
##     x4                1.000                               0.984    0.846
##     x5                1.130    0.067   16.946    0.000    1.112    0.863
##     x6                0.925    0.056   16.424    0.000    0.910    0.832
##     x7                0.196    0.067    2.918    0.004    0.193    0.177
##     x8                0.186    0.062    2.984    0.003    0.183    0.181
##     x9                0.279    0.062    4.539    0.000    0.275    0.273
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .x4                0.383    0.048    7.903    0.000    0.383    0.284
##    .x5                0.424    0.059    7.251    0.000    0.424    0.256
##    .x6                0.368    0.044    8.419    0.000    0.368    0.308
##    .x7                1.146    0.094   12.217    0.000    1.146    0.969
##    .x8                0.988    0.081   12.215    0.000    0.988    0.967
##    .x9                0.940    0.077   12.142    0.000    0.940    0.926
##     textspeed         0.968    0.112    8.647    0.000    1.000    1.000
summary(text.fit, fit.measures=TRUE)
## lavaan 0.6-3 ended normally after 20 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         12
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     149.786
##   Degrees of freedom                                 9
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              681.336
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.789
##   Tucker-Lewis Index (TLI)                       0.648
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2476.130
##   Loglikelihood unrestricted model (H1)      -2401.237
## 
##   Number of free parameters                         12
##   Akaike (AIC)                                4976.261
##   Bayesian (BIC)                              5020.746
##   Sample-size adjusted Bayesian (BIC)         4982.689
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.228
##   90 Percent Confidence Interval          0.197  0.261
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.148
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   textspeed =~                                        
##     x4                1.000                           
##     x5                1.130    0.067   16.946    0.000
##     x6                0.925    0.056   16.424    0.000
##     x7                0.196    0.067    2.918    0.004
##     x8                0.186    0.062    2.984    0.003
##     x9                0.279    0.062    4.539    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x4                0.383    0.048    7.903    0.000
##    .x5                0.424    0.059    7.251    0.000
##    .x6                0.368    0.044    8.419    0.000
##    .x7                1.146    0.094   12.217    0.000
##    .x8                0.988    0.081   12.215    0.000
##    .x9                0.940    0.077   12.142    0.000
##     textspeed         0.968    0.112    8.647    0.000
#Look at the dataset
data(PoliticalDemocracy, package="lavaan")
head(PoliticalDemocracy)
##      y1       y2       y3       y4       y5       y6       y7       y8
## 1  2.50 0.000000 3.333333 0.000000 1.250000 0.000000 3.726360 3.333333
## 2  1.25 0.000000 3.333333 0.000000 6.250000 1.100000 6.666666 0.736999
## 3  7.50 8.800000 9.999998 9.199991 8.750000 8.094061 9.999998 8.211809
## 4  8.90 8.800000 9.999998 9.199991 8.907948 8.127979 9.999998 4.615086
## 5 10.00 3.333333 9.999998 6.666666 7.500000 3.333333 9.999998 6.666666
## 6  7.50 3.333333 6.666666 6.666666 6.250000 1.100000 6.666666 0.368500
##         x1       x2       x3
## 1 4.442651 3.637586 2.557615
## 2 5.384495 5.062595 3.568079
## 3 5.961005 6.255750 5.224433
## 4 6.285998 7.567863 6.267495
## 5 5.863631 6.818924 4.573679
## 6 5.533389 5.135798 3.892270
#Define your model specification
politics.model <- "poldemo60 =~ y1 + y2 + y3 + y4"

#Analyze the model with cfa()
politics.fit <- lavaan::cfa(model = politics.model, data = PoliticalDemocracy)

#Summarize the model
summary(politics.fit, standardized=TRUE, fit.measures=TRUE)
## lavaan 0.6-3 ended normally after 26 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          8
## 
##   Number of observations                            75
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      10.006
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.007
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              159.183
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.948
##   Tucker-Lewis Index (TLI)                       0.843
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -704.138
##   Loglikelihood unrestricted model (H1)       -699.135
## 
##   Number of free parameters                          8
##   Akaike (AIC)                                1424.275
##   Bayesian (BIC)                              1442.815
##   Sample-size adjusted Bayesian (BIC)         1417.601
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.231
##   90 Percent Confidence Interval          0.103  0.382
##   P-value RMSEA <= 0.05                          0.014
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.046
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   poldemo60 =~                                                          
##     y1                1.000                               2.133    0.819
##     y2                1.404    0.197    7.119    0.000    2.993    0.763
##     y3                1.089    0.167    6.529    0.000    2.322    0.712
##     y4                1.370    0.167    8.228    0.000    2.922    0.878
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .y1                2.239    0.512    4.371    0.000    2.239    0.330
##    .y2                6.412    1.293    4.960    0.000    6.412    0.417
##    .y3                5.229    0.990    5.281    0.000    5.229    0.492
##    .y4                2.530    0.765    3.306    0.001    2.530    0.229
##     poldemo60         4.548    1.106    4.112    0.000    1.000    1.000

Chapter 2 - Multi-Factor Models

Multifactor Specification - exploring multiple latent relationships, and their relationships to each other:

  • Combining manifest variables that represent different latent variables often results in a model with poor fit
  • Can instead convert each of the manifest variables to the appropriate latent variable, for example
    • visual.model <- ‘visual =~ x1 + x2 + x3’
    • visual.fit <- cfa(model = visual.model, data = HolzingerSwineford1939)
    • speed.model <- ‘speed =~ x7 + x8 + x9’
    • speed.fit <- cfa(model = speed.model, data = HolzingerSwineford1939)
  • However, having too many models can lead to having zero degrees of freedom; constraints (such as same loading for x2/x3) are used to address this
    • visual.model <- ‘visual =~ x1 + ax2 + ax3’ # The a means that x2 and x3 will be set equal to each other, while a number rather than a would use that exact number
  • One larger model can sometimes better capture all the relationships
    • twofactor.model <- ‘visual =~ x1 + x2 + x3 speed =~ x7 + x8 + x9’ # adding them all at the same time (must have new line for new model)
    • twofactor.fit <- cfa(model = twofactor.model, data = HolzingerSwineford1939)
    • summary(twofactor.fit, standardized = TRUE, fit.measures = TRUE)

Model Structure:

  • The two-factor model assumes there is a covariant relationship between the latent variables - basically, one latent variable can predict another
  • Can see the correlation between the standardized variables using the summary() function - technically shows R-squared
    • =~ creates latent variables
    • ~~ creates covariances between variables
    • ~ creates direct prediction between variables
    • if there is a newline followed by ’speed ~~ 0*visual’ then speed will be assumed NOT to vary at all with visual
    • if there is a newline followed by ‘speed ~ visual’ then there is assumed to be a direct relationship between these variables

Modification Indices:

  • If a model has a poor fit, can examine the standardized solutions - desire is to see loading greater than 0.3
  • Model problems can often be identified by variances that are very high relative to the raw data
  • Modification indices can help show the improvement in the model when an additional index is added
    • modificationindices(twofactor.fit, sort = TRUE)
    • Output will be lhs op rhs (left-hand side, operator, right-hand-side) followed by mi (modification index, a form of chi-squared)
    • Parameters should be added one at a time, and only if they “make theoretical sense”
    • Take the desired path(s) and add them as new lines in the model

Model Comparison:

  • Can create and save two models, then analyze both using the same cfa(), then use anova() to compare the models
    • anova(twofactor.fit, twofactor.fit1)
    • This is only useful for nested models that otherwise share the same variables
  • Can also compare the fit indices using more detailed criteria
    • fitmeasures(twofactor.fit)
    • AIC (lower is better, including more negative better than less negative)
    • ECVI is the likelihood of replicating the model with the same sample size and population (lower is better)
    • fitmeasures(twofactor.fit1, c(“aic”, “ecvi”))

Example code includes:

#Create your text model specification
text.model <- 'text =~ x4 + x5 + x6'

#Analyze the model
text.fit <- cfa(model=text.model, data=HolzingerSwineford1939)

#Summarize the model
summary(text.fit, standardized = TRUE, fit.measures = TRUE)
## lavaan 0.6-3 ended normally after 15 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          6
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              497.430
##   Degrees of freedom                                 3
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1181.065
##   Loglikelihood unrestricted model (H1)      -1181.065
## 
##   Number of free parameters                          6
##   Akaike (AIC)                                2374.130
##   Bayesian (BIC)                              2396.372
##   Sample-size adjusted Bayesian (BIC)         2377.344
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent Confidence Interval          0.000  0.000
##   P-value RMSEA <= 0.05                             NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   text =~                                                               
##     x4                1.000                               0.984    0.847
##     x5                1.133    0.067   16.906    0.000    1.115    0.866
##     x6                0.924    0.056   16.391    0.000    0.910    0.832
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .x4                0.382    0.049    7.805    0.000    0.382    0.283
##    .x5                0.416    0.059    7.038    0.000    0.416    0.251
##    .x6                0.369    0.044    8.367    0.000    0.369    0.308
##     text              0.969    0.112    8.640    0.000    1.000    1.000
#Update the model specification by setting two paths to the label a
text.model <- 'text =~ x4 + a*x5 + a*x6'

#Analyze the model
text.fit <- cfa(model = text.model, data = HolzingerSwineford1939)

#Summarize the model
summary(text.fit, standardized = TRUE, fit.measures = TRUE)
## lavaan 0.6-3 ended normally after 14 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          6
##   Number of equality constraints                     1
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      11.227
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.001
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              497.430
##   Degrees of freedom                                 3
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.979
##   Tucker-Lewis Index (TLI)                       0.938
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1186.678
##   Loglikelihood unrestricted model (H1)      -1181.065
## 
##   Number of free parameters                          5
##   Akaike (AIC)                                2383.357
##   Bayesian (BIC)                              2401.892
##   Sample-size adjusted Bayesian (BIC)         2386.035
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.184
##   90 Percent Confidence Interval          0.098  0.288
##   P-value RMSEA <= 0.05                          0.007
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.073
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   text =~                                                               
##     x4                1.000                               0.983    0.846
##     x5         (a)    1.009    0.054   18.747    0.000    0.992    0.815
##     x6         (a)    1.009    0.054   18.747    0.000    0.992    0.866
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .x4                0.383    0.050    7.631    0.000    0.383    0.284
##    .x5                0.499    0.054    9.164    0.000    0.499    0.336
##    .x6                0.328    0.045    7.285    0.000    0.328    0.250
##     text              0.967    0.113    8.585    0.000    1.000    1.000
#Create a two-factor model of text and speed variables
twofactor.model <- 'text =~ x4 + x5 + x6
speed =~ x7 + x8 + x9'

#Previous one-factor model output
summary(text.fit, standardized = TRUE, fit.measures = TRUE)
## lavaan 0.6-3 ended normally after 14 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          6
##   Number of equality constraints                     1
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      11.227
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.001
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              497.430
##   Degrees of freedom                                 3
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.979
##   Tucker-Lewis Index (TLI)                       0.938
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1186.678
##   Loglikelihood unrestricted model (H1)      -1181.065
## 
##   Number of free parameters                          5
##   Akaike (AIC)                                2383.357
##   Bayesian (BIC)                              2401.892
##   Sample-size adjusted Bayesian (BIC)         2386.035
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.184
##   90 Percent Confidence Interval          0.098  0.288
##   P-value RMSEA <= 0.05                          0.007
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.073
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   text =~                                                               
##     x4                1.000                               0.983    0.846
##     x5         (a)    1.009    0.054   18.747    0.000    0.992    0.815
##     x6         (a)    1.009    0.054   18.747    0.000    0.992    0.866
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .x4                0.383    0.050    7.631    0.000    0.383    0.284
##    .x5                0.499    0.054    9.164    0.000    0.499    0.336
##    .x6                0.328    0.045    7.285    0.000    0.328    0.250
##     text              0.967    0.113    8.585    0.000    1.000    1.000
#Two-factor model specification
twofactor.model <- 'text =~ x4 + x5 + x6
speed =~ x7 + x8 + x9'

#Use cfa() to analyze the model
twofactor.fit <- cfa(model=twofactor.model, data=HolzingerSwineford1939)

#Use summary() to view the fitted model
summary(twofactor.fit, standardized = TRUE, fit.measures = TRUE)
## lavaan 0.6-3 ended normally after 24 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         13
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      14.354
##   Degrees of freedom                                 8
##   P-value (Chi-square)                           0.073
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              681.336
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.990
##   Tucker-Lewis Index (TLI)                       0.982
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2408.414
##   Loglikelihood unrestricted model (H1)      -2401.237
## 
##   Number of free parameters                         13
##   Akaike (AIC)                                4842.828
##   Bayesian (BIC)                              4891.021
##   Sample-size adjusted Bayesian (BIC)         4849.792
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.051
##   90 Percent Confidence Interval          0.000  0.093
##   P-value RMSEA <= 0.05                          0.425
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.039
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   text =~                                                               
##     x4                1.000                               0.984    0.847
##     x5                1.132    0.067   16.954    0.000    1.114    0.865
##     x6                0.925    0.056   16.438    0.000    0.911    0.833
##   speed =~                                                              
##     x7                1.000                               0.674    0.619
##     x8                1.150    0.165    6.990    0.000    0.775    0.766
##     x9                0.878    0.123    7.166    0.000    0.592    0.587
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   text ~~                                                               
##     speed             0.173    0.052    3.331    0.001    0.261    0.261
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .x4                0.382    0.049    7.854    0.000    0.382    0.283
##    .x5                0.418    0.059    7.113    0.000    0.418    0.252
##    .x6                0.367    0.044    8.374    0.000    0.367    0.307
##    .x7                0.729    0.084    8.731    0.000    0.729    0.616
##    .x8                0.422    0.084    5.039    0.000    0.422    0.413
##    .x9                0.665    0.071    9.383    0.000    0.665    0.655
##     text              0.969    0.112    8.647    0.000    1.000    1.000
##     speed             0.454    0.096    4.728    0.000    1.000    1.000
#Load the library and data
data(epi, package="psych")

#Specify a three-factor model with one correlation set to zero
epi.model <- 'extraversion =~ V1 + V3 + V5 + V8
neuroticism =~ V2 + V4 + V7 + V9
lying =~ V6 + V12 + V18 + V24
extraversion ~~ 0*neuroticism'

#Run the model
epi.fit <- cfa(model = epi.model, data = epi)

#Examine the output 
summary(epi.fit, standardized = TRUE, fit.measures = TRUE)
## lavaan 0.6-3 ended normally after 118 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         26
## 
##                                                   Used       Total
##   Number of observations                          3193        3570
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     584.718
##   Degrees of freedom                                52
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic             2196.019
##   Degrees of freedom                                66
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.750
##   Tucker-Lewis Index (TLI)                       0.683
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -23208.145
##   Loglikelihood unrestricted model (H1)     -22915.787
## 
##   Number of free parameters                         26
##   Akaike (AIC)                               46468.291
##   Bayesian (BIC)                             46626.077
##   Sample-size adjusted Bayesian (BIC)        46543.464
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.057
##   90 Percent Confidence Interval          0.053  0.061
##   P-value RMSEA <= 0.05                          0.004
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.058
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   extraversion =~                                                       
##     V1                1.000                               0.052    0.115
##     V3                1.360    0.329    4.127    0.000    0.070    0.141
##     V5               -2.829    0.554   -5.109    0.000   -0.146   -0.391
##     V8                7.315    1.832    3.992    0.000    0.377    0.797
##   neuroticism =~                                                        
##     V2                1.000                               0.228    0.457
##     V4                0.424    0.053    8.004    0.000    0.097    0.196
##     V7                1.395    0.093   15.023    0.000    0.318    0.648
##     V9                1.205    0.078   15.506    0.000    0.275    0.553
##   lying =~                                                              
##     V6                1.000                               0.135    0.272
##     V12              -0.851    0.132   -6.435    0.000   -0.115   -0.291
##     V18              -0.785    0.122   -6.421    0.000   -0.106   -0.289
##     V24               1.086    0.161    6.734    0.000    0.147    0.339
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   extraversion ~~                                                       
##     neuroticism       0.000                               0.000    0.000
##     lying            -0.002    0.001   -3.313    0.001   -0.258   -0.258
##   neuroticism ~~                                                        
##     lying            -0.014    0.002   -6.867    0.000   -0.469   -0.469
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .V1                0.198    0.005   39.567    0.000    0.198    0.987
##    .V3                0.243    0.006   39.278    0.000    0.243    0.980
##    .V5                0.118    0.005   23.900    0.000    0.118    0.847
##    .V8                0.082    0.026    3.084    0.002    0.082    0.364
##    .V2                0.197    0.006   32.516    0.000    0.197    0.791
##    .V4                0.235    0.006   38.906    0.000    0.235    0.962
##    .V7                0.140    0.007   19.412    0.000    0.140    0.580
##    .V9                0.172    0.006   26.591    0.000    0.172    0.694
##    .V6                0.228    0.007   34.520    0.000    0.228    0.926
##    .V12               0.143    0.004   33.670    0.000    0.143    0.916
##    .V18               0.124    0.004   33.753    0.000    0.124    0.917
##    .V24               0.166    0.005   31.021    0.000    0.166    0.885
##     extraversion      0.003    0.001    2.480    0.013    1.000    1.000
##     neuroticism       0.052    0.005   10.010    0.000    1.000    1.000
##     lying             0.018    0.004    4.500    0.000    1.000    1.000
#Specify a three-factor model where lying is predicted by neuroticism
epi.model <- 'extraversion =~ V1 + V3 + V5 + V8
neuroticism =~ V2 + V4 + V7 + V9
lying =~ V6 + V12 + V18 + V24
lying ~ neuroticism'


#Run the model
epi.fit <- cfa(model = epi.model, data = epi)

#Examine the output 
summary(epi.fit, standardized = TRUE, fit.measures = TRUE)
## lavaan 0.6-3 ended normally after 120 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         26
## 
##                                                   Used       Total
##   Number of observations                          3193        3570
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     534.426
##   Degrees of freedom                                52
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic             2196.019
##   Degrees of freedom                                66
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.774
##   Tucker-Lewis Index (TLI)                       0.713
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -23183.000
##   Loglikelihood unrestricted model (H1)     -22915.787
## 
##   Number of free parameters                         26
##   Akaike (AIC)                               46417.999
##   Bayesian (BIC)                             46575.786
##   Sample-size adjusted Bayesian (BIC)        46493.173
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.054
##   90 Percent Confidence Interval          0.050  0.058
##   P-value RMSEA <= 0.05                          0.058
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.053
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   extraversion =~                                                       
##     V1                1.000                               0.052    0.115
##     V3                1.135    0.268    4.230    0.000    0.059    0.118
##     V5               -2.497    0.443   -5.638    0.000   -0.129   -0.346
##     V8                8.223    2.008    4.096    0.000    0.425    0.898
##   neuroticism =~                                                        
##     V2                1.000                               0.223    0.447
##     V4                0.462    0.054    8.493    0.000    0.103    0.209
##     V7                1.435    0.093   15.368    0.000    0.320    0.652
##     V9                1.214    0.078   15.570    0.000    0.271    0.545
##   lying =~                                                              
##     V6                1.000                               0.125    0.252
##     V12              -0.943    0.150   -6.274    0.000   -0.118   -0.298
##     V18              -0.905    0.143   -6.339    0.000   -0.113   -0.308
##     V24               1.187    0.182    6.509    0.000    0.148    0.342
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   lying ~                                                               
##     neuroticism      -0.298    0.043   -6.943    0.000   -0.532   -0.532
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   extraversion ~~                                                       
##     neuroticism       0.003    0.001    3.761    0.000    0.240    0.240
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .V1                0.198    0.005   39.671    0.000    0.198    0.987
##    .V3                0.244    0.006   39.651    0.000    0.244    0.986
##    .V5                0.123    0.004   28.256    0.000    0.123    0.881
##    .V8                0.043    0.033    1.302    0.193    0.043    0.193
##    .V2                0.200    0.006   33.262    0.000    0.200    0.800
##    .V4                0.233    0.006   38.804    0.000    0.233    0.956
##    .V7                0.139    0.007   20.087    0.000    0.139    0.575
##    .V9                0.174    0.006   27.907    0.000    0.174    0.703
##    .V6                0.231    0.007   35.398    0.000    0.231    0.936
##    .V12               0.143    0.004   33.349    0.000    0.143    0.911
##    .V18               0.122    0.004   32.825    0.000    0.122    0.905
##    .V24               0.166    0.005   30.854    0.000    0.166    0.883
##     extraversion      0.003    0.001    2.643    0.008    1.000    1.000
##     neuroticism       0.050    0.005    9.947    0.000    1.000    1.000
##    .lying             0.011    0.003    3.970    0.000    0.717    0.717
#Calculate the variance of V1
var(epi$V1, na.rm=TRUE)
## [1] 0.2018335
#Examine the modification indices
modificationindices(epi.fit, sort=TRUE)
##              lhs op   rhs      mi    epc sepc.lv sepc.all sepc.nox
## 39   neuroticism =~    V3 169.990 -0.704  -0.157   -0.316   -0.316
## 38   neuroticism =~    V1 125.582  0.545   0.122    0.271    0.271
## 47         lying =~    V3 119.773  1.294   0.162    0.325    0.325
## 69            V3 ~~    V7  77.456 -0.034  -0.034   -0.186   -0.186
## 57            V1 ~~    V2  76.872  0.033   0.033    0.164    0.164
## 46         lying =~    V1  64.865 -0.857  -0.107   -0.239   -0.239
## 58            V1 ~~    V4  27.288  0.020   0.020    0.093    0.093
## 34  extraversion =~    V6  24.201 -0.948  -0.049   -0.099   -0.099
## 121  neuroticism ~~ lying  23.184  0.020   0.854    0.854    0.854
## 120 extraversion ~~ lying  23.184 -0.001  -0.205   -0.205   -0.205
## 122  neuroticism  ~ lying  23.184  1.798   1.008    1.008    1.008
## 102           V4 ~~   V12  18.360  0.015   0.015    0.079    0.079
## 88            V8 ~~    V6  17.629 -0.016  -0.016   -0.161   -0.161
## 49         lying =~    V8  17.324 -0.641  -0.080   -0.170   -0.170
## 66            V3 ~~    V8  16.886  0.042   0.042    0.411    0.411
## 74            V3 ~~   V24  16.332  0.015   0.015    0.075    0.075
## 31  extraversion =~    V4  15.808  0.781   0.040    0.082    0.082
## 50         lying =~    V2  15.345  0.713   0.089    0.179    0.179
## 56            V1 ~~    V8  14.184 -0.035  -0.035   -0.373   -0.373
## 112           V9 ~~   V18  13.078  0.011   0.011    0.076    0.076
## 45   neuroticism =~   V24  10.674  0.319   0.071    0.164    0.164
## 65            V3 ~~    V5  10.297 -0.011  -0.011   -0.064   -0.064
## 67            V3 ~~    V2   9.800 -0.013  -0.013   -0.058   -0.058
## 83            V5 ~~   V24   9.658  0.008   0.008    0.058    0.058
## 85            V8 ~~    V4   9.532  0.012   0.012    0.118    0.118
## 98            V2 ~~   V24   8.780  0.011   0.011    0.060    0.060
## 115           V6 ~~   V18   8.093  0.010   0.010    0.062    0.062
## 63            V1 ~~   V18   7.944  0.008   0.008    0.052    0.052
## 116           V6 ~~   V24   7.436  0.012   0.012    0.063    0.063
## 84            V8 ~~    V2   7.279 -0.010  -0.010   -0.112   -0.112
## 110           V9 ~~    V6   7.179  0.011   0.011    0.055    0.055
## 70            V3 ~~    V9   7.091 -0.011  -0.011   -0.052   -0.052
## 60            V1 ~~    V9   6.550  0.009   0.009    0.050    0.050
## 73            V3 ~~   V18   6.533 -0.008  -0.008   -0.047   -0.047
## 30  extraversion =~    V2   6.090 -0.487  -0.025   -0.050   -0.050
## 44   neuroticism =~   V18   5.972  0.186   0.042    0.113    0.113
## 51         lying =~    V4   5.278 -0.411  -0.051   -0.104   -0.104
## 33  extraversion =~    V9   4.903 -0.446  -0.023   -0.046   -0.046
## 87            V8 ~~    V9   4.841 -0.009  -0.009   -0.099   -0.099
## 59            V1 ~~    V7   4.588  0.007   0.007    0.045    0.045
## 68            V3 ~~    V4   4.293  0.009   0.009    0.037    0.037
## 106           V7 ~~    V6   4.219 -0.009  -0.009   -0.048   -0.048
## 93            V2 ~~    V7   4.179  0.014   0.014    0.082    0.082
## 75            V5 ~~    V8   3.508  0.056   0.056    0.775    0.775
## 99            V4 ~~    V7   3.431 -0.009  -0.009   -0.048   -0.048
## 95            V2 ~~    V6   3.106  0.007   0.007    0.034    0.034
## 55            V1 ~~    V5   2.977  0.005   0.005    0.034    0.034
## 94            V2 ~~    V9   2.860  0.009   0.009    0.051    0.051
## 35  extraversion =~   V12   2.777  0.256   0.013    0.033    0.033
## 37  extraversion =~   V24   2.668 -0.275  -0.014   -0.033   -0.033
## 48         lying =~    V5   2.453  0.155   0.019    0.052    0.052
## 109           V7 ~~   V24   2.021  0.005   0.005    0.035    0.035
## 89            V8 ~~   V12   1.910  0.004   0.004    0.054    0.054
## 91            V8 ~~   V24   1.733 -0.004  -0.004   -0.052   -0.052
## 86            V8 ~~    V7   1.640  0.005   0.005    0.068    0.068
## 80            V5 ~~    V6   1.631  0.004   0.004    0.023    0.023
## 97            V2 ~~   V18   1.600 -0.004  -0.004   -0.025   -0.025
## 82            V5 ~~   V18   1.301  0.003   0.003    0.021    0.021
## 54            V1 ~~    V3   1.227 -0.004  -0.004   -0.020   -0.020
## 100           V4 ~~    V9   1.198 -0.005  -0.005   -0.024   -0.024
## 64            V1 ~~   V24   1.153  0.004   0.004    0.020    0.020
## 53         lying =~    V9   0.995 -0.189  -0.024   -0.048   -0.048
## 41   neuroticism =~    V8   0.823  0.163   0.036    0.077    0.077
## 62            V1 ~~   V12   0.814  0.003   0.003    0.016    0.016
## 90            V8 ~~   V18   0.797  0.003   0.003    0.035    0.035
## 76            V5 ~~    V2   0.714  0.003   0.003    0.016    0.016
## 117          V12 ~~   V18   0.698 -0.003  -0.003   -0.019   -0.019
## 71            V3 ~~    V6   0.637 -0.003  -0.003   -0.014   -0.014
## 36  extraversion =~   V18   0.579  0.109   0.006    0.015    0.015
## 61            V1 ~~    V6   0.468  0.003   0.003    0.012    0.012
## 113           V9 ~~   V24   0.404 -0.002  -0.002   -0.014   -0.014
## 118          V12 ~~   V24   0.343 -0.002  -0.002   -0.014   -0.014
## 96            V2 ~~   V12   0.341 -0.002  -0.002   -0.012   -0.012
## 43   neuroticism =~   V12   0.337  0.047   0.010    0.026    0.026
## 77            V5 ~~    V4   0.334  0.002   0.002    0.010    0.010
## 107           V7 ~~   V12   0.324 -0.002  -0.002   -0.014   -0.014
## 119          V18 ~~   V24   0.296 -0.002  -0.002   -0.014   -0.014
## 52         lying =~    V7   0.285  0.111   0.014    0.028    0.028
## 111           V9 ~~   V12   0.279 -0.002  -0.002   -0.011   -0.011
## 114           V6 ~~   V12   0.253  0.002   0.002    0.011    0.011
## 105           V7 ~~    V9   0.242 -0.004  -0.004   -0.027   -0.027
## 101           V4 ~~    V6   0.210 -0.002  -0.002   -0.008   -0.008
## 104           V4 ~~   V24   0.161  0.001   0.001    0.008    0.008
## 108           V7 ~~   V18   0.159 -0.001  -0.001   -0.010   -0.010
## 78            V5 ~~    V7   0.144  0.001   0.001    0.009    0.009
## 79            V5 ~~    V9   0.137 -0.001  -0.001   -0.008   -0.008
## 92            V2 ~~    V4   0.137  0.002   0.002    0.007    0.007
## 42   neuroticism =~    V6   0.128 -0.033  -0.007   -0.015   -0.015
## 32  extraversion =~    V7   0.021  0.031   0.002    0.003    0.003
## 81            V5 ~~   V12   0.017  0.000   0.000   -0.002   -0.002
## 103           V4 ~~   V18   0.009  0.000   0.000   -0.002   -0.002
## 40   neuroticism =~    V5   0.001 -0.002   0.000   -0.001   -0.001
## 72            V3 ~~   V12   0.000  0.000   0.000    0.000    0.000
#Edit the model specification
epi.model1 <- 'extraversion =~ V1 + V3 + V5 + V8
neuroticism =~ V2 + V4 + V7 + V9
lying =~ V6 + V12 + V18 + V24
neuroticism =~ V3'

#Reanalyze the model
epi.fit1 <- cfa(model = epi.model1, data = epi)

#Summarize the updated model
summary(epi.fit1, standardized = TRUE, fit.measures = TRUE)
## lavaan 0.6-3 ended normally after 126 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         28
## 
##                                                   Used       Total
##   Number of observations                          3193        3570
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     332.891
##   Degrees of freedom                                50
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic             2196.019
##   Degrees of freedom                                66
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.867
##   Tucker-Lewis Index (TLI)                       0.825
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -23082.232
##   Loglikelihood unrestricted model (H1)     -22915.787
## 
##   Number of free parameters                         28
##   Akaike (AIC)                               46220.465
##   Bayesian (BIC)                             46390.389
##   Sample-size adjusted Bayesian (BIC)        46301.421
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.042
##   90 Percent Confidence Interval          0.038  0.046
##   P-value RMSEA <= 0.05                          0.999
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.040
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   extraversion =~                                                       
##     V1                1.000                               0.068    0.152
##     V3                1.798    0.325    5.532    0.000    0.123    0.246
##     V5               -2.268    0.360   -6.291    0.000   -0.155   -0.414
##     V8                5.077    0.887    5.725    0.000    0.346    0.732
##   neuroticism =~                                                        
##     V2                1.000                               0.222    0.445
##     V4                0.432    0.053    8.134    0.000    0.096    0.194
##     V7                1.493    0.093   16.025    0.000    0.331    0.675
##     V9                1.186    0.074   15.938    0.000    0.263    0.530
##   lying =~                                                              
##     V6                1.000                               0.135    0.272
##     V12              -0.851    0.127   -6.699    0.000   -0.115   -0.290
##     V18              -0.799    0.119   -6.728    0.000   -0.108   -0.294
##     V24               1.115    0.157    7.087    0.000    0.151    0.347
##   neuroticism =~                                                        
##     V3               -0.732    0.066  -11.074    0.000   -0.163   -0.327
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   extraversion ~~                                                       
##     neuroticism       0.004    0.001    4.953    0.000    0.283    0.283
##     lying            -0.003    0.001   -4.380    0.000   -0.346   -0.346
##   neuroticism ~~                                                        
##     lying            -0.016    0.002   -7.337    0.000   -0.521   -0.521
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .V1                0.196    0.005   39.250    0.000    0.196    0.977
##    .V3                0.217    0.006   34.642    0.000    0.217    0.878
##    .V5                0.116    0.004   29.066    0.000    0.116    0.828
##    .V8                0.104    0.014    7.603    0.000    0.104    0.465
##    .V2                0.200    0.006   33.875    0.000    0.200    0.802
##    .V4                0.235    0.006   39.046    0.000    0.235    0.962
##    .V7                0.131    0.007   19.577    0.000    0.131    0.544
##    .V9                0.178    0.006   29.830    0.000    0.178    0.720
##    .V6                0.228    0.007   34.969    0.000    0.228    0.926
##    .V12               0.144    0.004   34.186    0.000    0.144    0.916
##    .V18               0.123    0.004   34.035    0.000    0.123    0.914
##    .V24               0.166    0.005   31.188    0.000    0.166    0.879
##     extraversion      0.005    0.001    3.265    0.001    1.000    1.000
##     neuroticism       0.049    0.005   10.127    0.000    1.000    1.000
##     lying             0.018    0.004    4.651    0.000    1.000    1.000
#Analyze the original model
epi.fit <- cfa(model = epi.model, data = epi)

#Analyze the updated model
epi.fit1 <- cfa(model = epi.model1, data = epi)

#Compare those models
anova(epi.fit, epi.fit1)
## Chi Square Difference Test
## 
##          Df   AIC   BIC  Chisq Chisq diff Df diff Pr(>Chisq)    
## epi.fit1 50 46220 46390 332.89                                  
## epi.fit  52 46418 46576 534.43     201.53       2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Analyze the original model
epi.fit <- cfa(model = epi.model, data = epi)

#Find the fit indices for the original model
fitmeasures(epi.fit)[c("aic", "ecvi")]
##          aic         ecvi 
## 4.641800e+04 1.836599e-01
#Analyze the updated model
epi.fit1 <- cfa(model = epi.model1, data = epi)

#Find the fit indices for the updated model
fitmeasures(epi.fit1)[c("aic", "ecvi")]
##          aic         ecvi 
## 46220.464546     0.121795

Chapter 3 - Troubleshooting Model Errors and Diagrams

Heywood Cases on the Latent Variable:

  • Heywood cases (defined by Heywood in 1931) are cases where correlations (greater than 1) or variances (negative) are out of bounds
  • The lavaan package will throw a warning that the matrix of latent variables is “not positive definite”
    • Usually occurs because one of the latent variables is really a combination of the others
    • Can then identify the highly correlated variables, and collapse them in to a single equation (fewer factors or the like)

Heywood Cases on the Manifest Variable (negative error variances):

  • Generally occur dur to a mis-specified (under-specified) model, small sample sizes, manifest variables on vastly different scales, etc.
  • The lavaan package will throw a warning that “model has not converged”
    • summary(negative.fit, standardized = TRUE, fit.measures = TRUE, rsquare = TRUE) # rsquare can help to identify the issue; variance in each manifest variable should be (0, 1)
  • Can just freeze the variance of one of the wonky variables to its variance in the raw data
    • negative.model <- ’latent1 =~ V1 + V2 + V32 =~ V4 + V5 + V62 ~~ 18.83833*V2’ # 18.84 is var(V2)

Create Diagrams with semPaths():

  • The semPlot library allows for diagramming the fit models
    • library(semPlot)
    • twofactor.model <- ‘text =~ x4 + x5 + x6=~ x7 + x8 + x9’
    • twofactor.fit <- cfa(model = twofactor.model, data = HolzingerSwineford1939)
    • semPaths(object = twofactor.fit)
  • The double-headed arrows on the manifest variables are variances, and the double-headed arrows on the latent variables are covariances
  • There are many options for semPaths, and allow a few will be covered here
    • semPaths(object = twofactor.fit, whatLabels = “std”, edge.label.cex = 1) # std is standardized while par is parameters; edge.label.cex is the font size for the edges
    • semPaths(object = twofactor.fit, whatLabels = “std”, edge.label.cex = 1, layout = “circle”) # “tree” is the default for layouts
    • semPaths(object = twofactor.fit, whatLabels = “std”, edge.label.cex = 1, layout = “tree”, rotation = 2) # rotation can only be used for trees; 2 means left/right
    • semPaths(object = twofactor.fit, whatLabels = “std”, edge.label.cex = 1, layout = “tree”, rotation = 2, what = “std”, edge.color = “purple”) # what colors arrows by strength

Example code includes:

badlatentdata <- readr::read_csv("./RInputFiles/badlatentdata.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   V1 = col_double(),
##   V2 = col_double(),
##   V3 = col_double(),
##   V4 = col_double(),
##   V5 = col_double(),
##   V6 = col_double()
## )
badvardata <- readr::read_csv("./RInputFiles/badvardata.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   V1 = col_double(),
##   V2 = col_double(),
##   V3 = col_double(),
##   V4 = col_double(),
##   V5 = col_double(),
##   V6 = col_double()
## )
adoptsurvey <- badlatentdata %>%
    select(-X1) %>%
    rename(pictures=V1, background=V2, loveskids=V3, energy=V4, wagstail=V5, playful=V6)

#Look at the data
str(adoptsurvey, give.attr=FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 100 obs. of  6 variables:
##  $ pictures  : num  3.66318 -0.00508 2.99697 -0.90249 4.54211 ...
##  $ background: num  3.07 7.7 1.51 3.03 7.22 ...
##  $ loveskids : num  10.31 3.06 6.61 1.54 3.38 ...
##  $ energy    : num  3.68 2.42 3.51 -3.04 12.93 ...
##  $ wagstail  : num  5.26 7.05 4.25 2.17 6.23 ...
##  $ playful   : num  8.275 11.727 0.675 2.457 13.43 ...
head(adoptsurvey)
## # A tibble: 6 x 6
##   pictures background loveskids energy wagstail playful
##      <dbl>      <dbl>     <dbl>  <dbl>    <dbl>   <dbl>
## 1  3.66          3.07     10.3    3.68     5.26   8.28 
## 2 -0.00508       7.70      3.06   2.42     7.05  11.7  
## 3  3.00          1.51      6.61   3.51     4.25   0.675
## 4 -0.902         3.03      1.54  -3.04     2.17   2.46 
## 5  4.54          7.22      3.38  12.9      6.23  13.4  
## 6  0.0257       -4.35     -1.95  -6.07     3.13   5.60
#Build the model
adopt.model <- 'goodstory =~ pictures + background + loveskids
inperson =~ energy + wagstail + playful'

#Analyze the model
adopt.fit <- cfa(model = adopt.model, data = adoptsurvey)
## Warning in lav_object_post_check(object): lavaan WARNING: covariance matrix of latent variables
##                 is not positive definite;
##                 use lavInspect(fit, "cov.lv") to investigate.
lavInspect(adopt.fit, "cov.lv")
##           gdstry inprsn
## goodstory 0.397        
## inperson  4.780  4.505
summary(adopt.fit, standardized=TRUE, fit.measures=TRUE)
## lavaan 0.6-3 ended normally after 61 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         13
## 
##   Number of observations                           100
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      15.674
##   Degrees of freedom                                 8
##   P-value (Chi-square)                           0.047
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic               74.694
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.871
##   Tucker-Lewis Index (TLI)                       0.759
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1666.548
##   Loglikelihood unrestricted model (H1)      -1658.711
## 
##   Number of free parameters                         13
##   Akaike (AIC)                                3359.096
##   Bayesian (BIC)                              3392.963
##   Sample-size adjusted Bayesian (BIC)         3351.906
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.098
##   90 Percent Confidence Interval          0.010  0.170
##   P-value RMSEA <= 0.05                          0.126
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.080
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   goodstory =~                                                          
##     pictures          1.000                               0.630    0.200
##     background        1.089    0.347    3.135    0.002    0.686    0.168
##     loveskids         0.041    0.259    0.158    0.874    0.026    0.006
##   inperson =~                                                           
##     energy            1.000                               2.122    0.538
##     wagstail          1.134    0.300    3.780    0.000    2.406    0.473
##     playful           0.601    0.213    2.823    0.005    1.275    0.329
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   goodstory ~~                                                          
##     inperson          4.780    1.248    3.830    0.000    3.575    3.575
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .pictures          9.570    1.749    5.471    0.000    9.570    0.960
##    .background       16.198    2.641    6.133    0.000   16.198    0.972
##    .loveskids        21.675    3.065    7.071    0.000   21.675    1.000
##    .energy           11.031    1.920    5.745    0.000   11.031    0.710
##    .wagstail         20.085    3.199    6.278    0.000   20.085    0.776
##    .playful          13.382    1.955    6.845    0.000   13.382    0.892
##     goodstory         0.397    1.176    0.338    0.736    1.000    1.000
##     inperson          4.505    1.910    2.359    0.018    1.000    1.000
#Edit the original model 
adopt.model <- 'goodstory =~ pictures + background + loveskids + energy + wagstail + playful'

#Analyze the model
adopt.fit <- cfa(model = adopt.model, data = adoptsurvey)

#Look for Heywood cases
summary(adopt.fit, standardized = TRUE, fit.measures = TRUE)
## lavaan 0.6-3 ended normally after 49 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         12
## 
##   Number of observations                           100
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                      27.071
##   Degrees of freedom                                 9
##   P-value (Chi-square)                           0.001
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic               74.694
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.697
##   Tucker-Lewis Index (TLI)                       0.495
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1672.246
##   Loglikelihood unrestricted model (H1)      -1658.711
## 
##   Number of free parameters                         12
##   Akaike (AIC)                                3368.493
##   Bayesian (BIC)                              3399.755
##   Sample-size adjusted Bayesian (BIC)         3361.856
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.142
##   90 Percent Confidence Interval          0.082  0.205
##   P-value RMSEA <= 0.05                          0.009
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.086
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   goodstory =~                                                          
##     pictures          1.000                               1.773    0.562
##     background        0.892    0.337    2.650    0.008    1.581    0.387
##     loveskids         0.547    0.344    1.587    0.112    0.969    0.208
##     energy            1.194    0.372    3.214    0.001    2.118    0.537
##     wagstail          1.712    0.517    3.310    0.001    3.035    0.597
##     playful           0.773    0.312    2.480    0.013    1.371    0.354
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .pictures          6.824    1.323    5.160    0.000    6.824    0.685
##    .background       14.168    2.228    6.359    0.000   14.168    0.850
##    .loveskids        20.736    3.009    6.891    0.000   20.736    0.957
##    .energy           11.051    2.049    5.394    0.000   11.051    0.711
##    .wagstail         16.661    3.486    4.779    0.000   16.661    0.644
##    .playful          13.128    2.021    6.496    0.000   13.128    0.875
##     goodstory         3.143    1.369    2.296    0.022    1.000    1.000
adoptsurvey <- badvardata %>%
    select(-X1) %>%
    rename(pictures=V1, background=V2, loveskids=V3, energy=V4, wagstail=V5, playful=V6)
str(adoptsurvey, give.attr=FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 100 obs. of  6 variables:
##  $ pictures  : num  3.71 1.24 1.19 -1.26 4.58 ...
##  $ background: num  -0.964 6.38 -4.329 5.196 -0.145 ...
##  $ loveskids : num  3.86 5.95 8.23 2.46 9.53 ...
##  $ energy    : num  -6.73 1.61 4.09 7.6 -3.13 ...
##  $ wagstail  : num  -1.199 0.532 4.59 3.699 2.546 ...
##  $ playful   : num  4.1 1.93 4.04 4.56 3.43 ...
summary(adoptsurvey)
##     pictures         background         loveskids          energy      
##  Min.   :-4.5482   Min.   :-4.83064   Min.   :-7.162   Min.   :-6.729  
##  1st Qu.:-0.5815   1st Qu.:-0.03834   1st Qu.: 1.751   1st Qu.: 0.386  
##  Median : 1.7705   Median : 3.33339   Median : 4.585   Median : 2.081  
##  Mean   : 1.7912   Mean   : 3.04096   Mean   : 5.178   Mean   : 2.362  
##  3rd Qu.: 3.6228   3rd Qu.: 5.57599   3rd Qu.: 8.709   3rd Qu.: 4.906  
##  Max.   : 9.4674   Max.   :14.81218   Max.   :18.237   Max.   :12.091  
##     wagstail         playful      
##  Min.   :-9.945   Min.   :-4.913  
##  1st Qu.:-1.226   1st Qu.: 1.811  
##  Median : 2.160   Median : 3.916  
##  Mean   : 2.346   Mean   : 3.711  
##  3rd Qu.: 5.242   3rd Qu.: 5.751  
##  Max.   :19.811   Max.   :11.446
#Build the model
adopt.model <- 'goodstory =~ pictures + background + loveskids
inperson =~ energy + wagstail + playful'

#Analyze the model
adopt.fit <- cfa(model=adopt.model, data=adoptsurvey)
## Warning in lav_object_post_check(object): lavaan WARNING: some estimated ov
## variances are negative
#Summarize the model to view the negative variances
summary(adopt.fit, standardized=TRUE, fit.measures=TRUE, rsquare=TRUE)
## lavaan 0.6-3 ended normally after 303 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         13
## 
##   Number of observations                           100
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       7.134
##   Degrees of freedom                                 8
##   P-value (Chi-square)                           0.522
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic               25.380
##   Degrees of freedom                                15
##   P-value                                        0.045
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.156
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1649.956
##   Loglikelihood unrestricted model (H1)      -1646.389
## 
##   Number of free parameters                         13
##   Akaike (AIC)                                3325.912
##   Bayesian (BIC)                              3359.779
##   Sample-size adjusted Bayesian (BIC)         3318.722
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent Confidence Interval          0.000  0.109
##   P-value RMSEA <= 0.05                          0.686
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.050
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   goodstory =~                                                          
##     pictures          1.000                               1.360    0.437
##     background        1.471    0.763    1.928    0.054    2.000    0.521
##     loveskids         1.746    0.892    1.958    0.050    2.375    0.501
##   inperson =~                                                           
##     energy            1.000                               0.208    0.058
##     wagstail         45.262 1090.143    0.042    0.967    9.409    1.969
##     playful           0.869    1.110    0.783    0.434    0.181    0.054
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   goodstory ~~                                                          
##     inperson         -0.014    0.332   -0.041    0.967   -0.048   -0.048
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .pictures          7.814    1.514    5.162    0.000    7.814    0.809
##    .background       10.762    2.695    3.993    0.000   10.762    0.729
##    .loveskids        16.791    3.936    4.266    0.000   16.791    0.749
##    .energy           12.642    2.066    6.119    0.000   12.642    0.997
##    .wagstail        -65.677 2124.215   -0.031    0.975  -65.677   -2.875
##    .playful          11.148    1.760    6.335    0.000   11.148    0.997
##     goodstory         1.850    1.310    1.411    0.158    1.000    1.000
##     inperson          0.043    1.046    0.041    0.967    1.000    1.000
## 
## R-Square:
##                    Estimate
##     pictures          0.191
##     background        0.271
##     loveskids         0.251
##     energy            0.003
##     wagstail             NA
##     playful           0.003
#View the variance of the problem manifest variable
var(adoptsurvey$wagstail)
## [1] 23.07446
#Update the model using 5 decimal places
adopt.model2 <- 'goodstory =~ pictures + background + loveskids
inperson =~ energy + wagstail + playful
wagstail~~23.07446*wagstail'

#Analyze and summarize the updated model
adopt.fit2 <- cfa(model = adopt.model2, data = adoptsurvey)
summary(adopt.fit2, standardized=TRUE, fit.measures=TRUE, rsquare=TRUE)
## lavaan 0.6-3 ended normally after 69 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         12
## 
##   Number of observations                           100
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       8.493
##   Degrees of freedom                                 9
##   P-value (Chi-square)                           0.485
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic               25.380
##   Degrees of freedom                                15
##   P-value                                        0.045
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.081
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1650.635
##   Loglikelihood unrestricted model (H1)      -1646.389
## 
##   Number of free parameters                         12
##   Akaike (AIC)                                3325.270
##   Bayesian (BIC)                              3356.532
##   Sample-size adjusted Bayesian (BIC)         3318.633
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent Confidence Interval          0.000  0.108
##   P-value RMSEA <= 0.05                          0.664
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.058
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   goodstory =~                                                          
##     pictures          1.000                               1.344    0.432
##     background        1.461    0.758    1.928    0.054    1.964    0.511
##     loveskids         1.818    0.947    1.919    0.055    2.444    0.516
##   inperson =~                                                           
##     energy            1.000                               0.959    0.269
##     wagstail          1.391    2.244    0.620    0.535    1.334    0.268
##     playful           0.807    1.640    0.492    0.623    0.774    0.231
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   goodstory ~~                                                          
##     inperson         -0.077    0.450   -0.172    0.863   -0.060   -0.060
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .wagstail         23.074                              23.074    0.928
##    .pictures          7.857    1.510    5.203    0.000    7.857    0.813
##    .background       10.906    2.672    4.082    0.000   10.906    0.739
##    .loveskids        16.461    4.103    4.012    0.000   16.461    0.734
##    .energy           11.765    2.683    4.385    0.000   11.765    0.928
##    .playful          10.582    2.082    5.084    0.000   10.582    0.946
##     goodstory         1.807    1.296    1.395    0.163    1.000    1.000
##     inperson          0.920    2.209    0.416    0.677    1.000    1.000
## 
## R-Square:
##                    Estimate
##     wagstail          0.072
##     pictures          0.187
##     background        0.261
##     loveskids         0.266
##     energy            0.072
##     playful           0.054
#Create a default picture
semPlot::semPaths(adopt.fit)

#Update the default picture
semPlot::semPaths(object = adopt.fit, layout="tree", rotation=2)

#Update the default picture
semPlot::semPaths(object = adopt.fit, layout = "tree", rotation = 2, whatLabels = "std", 
                  edge.label.cex = 1, what = "std", edge.color = "blue"
                  )


Chapter 4 - Full Example and Extension

Model WAIS-III IQ Scale:

  • WAIS-III is a four-factor model of intelligence, including verbal, working memory, perceptual organization, and processing speed
    • Idea is that Verbal IQ drives verbal and working memory, Performance IQ drives perceptual and processing, and Verbal/Performance drive each other
    • 4 latent variables, measured by 12 manifest variables, with 2 additional latent variables at a higher layer that drive the initial 4 latent variables

Update WAIS-III Model:

  • Once the model is stable, can look for additional areas to further improve the model
  • Variables that are poor on loadings and are also high in variance should be further explored
  • Can also use modification indices to better understand and model the data
    • modificationindices(wais.fit, sort = TRUE)

Hierarchical Model of IQ:

  • One overall IQ that is the latent variable for all of the other latent variable
    • wais.model3 <- ’verbalcomp =~ vocab + simil + inform + compreh
    • workingmemory =~ arith + digspan + lnseq
    • perceptorg =~ piccomp + block + matrixreason + digsym + symbolsearch
    • simil ~~ inform
    • general =~ verbalcomp + workingmemory + perceptorg’ # the general is a new latent variable, built from other latent variables
  • The updated model will often have the same fit indices (simply shifting parameters from covariances to loadings)

Wrap Up:

  • Learned model syntax for lavaan (=~ for latent, ~~ for covariance/correlation, and ~ for prediction)
  • Learned to add constraints and troubleshoot Heywood cases
  • Learned one-factor, multi-factor, and hierarchical models

Example code includes:

IQdata <- readr::read_csv("./RInputFiles/IQdata.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   inform = col_double(),
##   simil = col_double(),
##   vocab = col_double(),
##   compreh = col_double(),
##   digspan = col_double(),
##   arith = col_double(),
##   piccomp = col_double(),
##   block = col_double(),
##   matrixreason = col_double(),
##   symbolsearch = col_double(),
##   digsym = col_double(),
##   lnseq = col_double()
## )
glimpse(IQdata)
## Observations: 300
## Variables: 13
## $ X1           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ inform       <dbl> 31, 15, 13, 13, 22, 25, 20, 18, 21, 22, 16, 23, 1...
## $ simil        <dbl> 23, 20, 22, 21, 21, 22, 25, 25, 22, 25, 25, 24, 1...
## $ vocab        <dbl> 63, 44, 40, 51, 55, 61, 45, 61, 57, 56, 62, 53, 4...
## $ compreh      <dbl> 27, 21, 28, 21, 28, 27, 23, 28, 27, 22, 28, 30, 2...
## $ digspan      <dbl> 20, 13, 14, 22, 17, 20, 13, 22, 14, 15, 15, 26, 1...
## $ arith        <dbl> 18, 12, 13, 13, 10, 20, 16, 14, 16, 10, 13, 21, 1...
## $ piccomp      <dbl> 18, 13, 13, 16, 13, 18, 16, 22, 16, 16, 20, 19, 1...
## $ block        <dbl> 50, 29, 28, 36, 22, 59, 33, 43, 40, 31, 35, 59, 4...
## $ matrixreason <dbl> 21, 17, 16, 14, 13, 18, 14, 18, 13, 13, 21, 16, 1...
## $ symbolsearch <dbl> 38, 24, 25, 27, 27, 38, 31, 42, 34, 29, 37, 40, 3...
## $ digsym       <dbl> 57, 56, 72, 67, 60, 78, 60, 45, 40, 57, 63, 87, 7...
## $ lnseq        <dbl> 15, 12, 13, 18, 15, 16, 12, 30, 19, 16, 23, 16, 1...
IQdata <- IQdata %>%
    select(-X1)
glimpse(IQdata)
## Observations: 300
## Variables: 12
## $ inform       <dbl> 31, 15, 13, 13, 22, 25, 20, 18, 21, 22, 16, 23, 1...
## $ simil        <dbl> 23, 20, 22, 21, 21, 22, 25, 25, 22, 25, 25, 24, 1...
## $ vocab        <dbl> 63, 44, 40, 51, 55, 61, 45, 61, 57, 56, 62, 53, 4...
## $ compreh      <dbl> 27, 21, 28, 21, 28, 27, 23, 28, 27, 22, 28, 30, 2...
## $ digspan      <dbl> 20, 13, 14, 22, 17, 20, 13, 22, 14, 15, 15, 26, 1...
## $ arith        <dbl> 18, 12, 13, 13, 10, 20, 16, 14, 16, 10, 13, 21, 1...
## $ piccomp      <dbl> 18, 13, 13, 16, 13, 18, 16, 22, 16, 16, 20, 19, 1...
## $ block        <dbl> 50, 29, 28, 36, 22, 59, 33, 43, 40, 31, 35, 59, 4...
## $ matrixreason <dbl> 21, 17, 16, 14, 13, 18, 14, 18, 13, 13, 21, 16, 1...
## $ symbolsearch <dbl> 38, 24, 25, 27, 27, 38, 31, 42, 34, 29, 37, 40, 3...
## $ digsym       <dbl> 57, 56, 72, 67, 60, 78, 60, 45, 40, 57, 63, 87, 7...
## $ lnseq        <dbl> 15, 12, 13, 18, 15, 16, 12, 30, 19, 16, 23, 16, 1...
#Build a four-factor model
wais.model <- 'verbalcomp =~ vocab + simil + inform + compreh 
workingmemory =~ arith + digspan + lnseq
perceptorg =~ piccomp + block + matrixreason
processing =~ digsym + symbolsearch'

#Analyze the model
wais.fit <- cfa(model=wais.model, data=IQdata)
## Warning in lav_object_post_check(object): lavaan WARNING: covariance matrix of latent variables
##                 is not positive definite;
##                 use lavInspect(fit, "cov.lv") to investigate.
#Summarize the model
summary(wais.fit, standardized=TRUE, fit.measures=TRUE, rsquare=TRUE)
## lavaan 0.6-3 ended normally after 153 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         30
## 
##   Number of observations                           300
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     233.268
##   Degrees of freedom                                48
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic             1042.916
##   Degrees of freedom                                66
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.810
##   Tucker-Lewis Index (TLI)                       0.739
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -9939.800
##   Loglikelihood unrestricted model (H1)      -9823.166
## 
##   Number of free parameters                         30
##   Akaike (AIC)                               19939.599
##   Bayesian (BIC)                             20050.713
##   Sample-size adjusted Bayesian (BIC)        19955.570
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.113
##   90 Percent Confidence Interval          0.099  0.128
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.073
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   verbalcomp =~                                                         
##     vocab             1.000                               6.282    0.879
##     simil             0.296    0.031    9.470    0.000    1.859    0.581
##     inform            0.450    0.043   10.483    0.000    2.825    0.645
##     compreh           0.315    0.035    8.986    0.000    1.979    0.551
##   workingmemory =~                                                      
##     arith             1.000                               2.530    0.845
##     digspan           0.875    0.137    6.373    0.000    2.213    0.561
##     lnseq             0.225    0.106    2.130    0.033    0.570    0.142
##   perceptorg =~                                                         
##     piccomp           1.000                               1.391    0.596
##     block             3.988    0.421    9.477    0.000    5.546    0.719
##     matrixreason      0.909    0.127    7.171    0.000    1.264    0.494
##   processing =~                                                         
##     digsym            1.000                               2.809    0.239
##     symbolsearch      1.065    0.300    3.547    0.000    2.990    0.724
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   verbalcomp ~~                                                         
##     workingmemory     6.120    1.232    4.969    0.000    0.385    0.385
##     perceptorg        5.644    0.868    6.503    0.000    0.646    0.646
##     processing       10.050    3.150    3.190    0.001    0.570    0.570
##   workingmemory ~~                                                      
##     perceptorg        2.437    0.371    6.561    0.000    0.693    0.693
##     processing        2.701    0.984    2.745    0.006    0.380    0.380
##   perceptorg ~~                                                         
##     processing        4.027    1.200    3.356    0.001    1.031    1.031
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .vocab            11.573    2.656    4.357    0.000   11.573    0.227
##    .simil             6.792    0.620   10.951    0.000    6.792    0.663
##    .inform           11.201    1.084   10.330    0.000   11.201    0.584
##    .compreh           8.969    0.804   11.157    0.000    8.969    0.696
##    .arith             2.560    0.901    2.842    0.004    2.560    0.286
##    .digspan          10.653    1.102    9.666    0.000   10.653    0.685
##    .lnseq            15.750    1.294   12.173    0.000   15.750    0.980
##    .piccomp           3.505    0.323   10.851    0.000    3.505    0.644
##    .block            28.761    3.207    8.968    0.000   28.761    0.483
##    .matrixreason      4.957    0.431   11.509    0.000    4.957    0.756
##    .digsym          130.314   10.847   12.014    0.000  130.314    0.943
##    .symbolsearch      8.127    2.480    3.277    0.001    8.127    0.476
##     verbalcomp       39.459    4.757    8.294    0.000    1.000    1.000
##     workingmemory     6.399    1.122    5.703    0.000    1.000    1.000
##     perceptorg        1.934    0.371    5.211    0.000    1.000    1.000
##     processing        7.889    4.309    1.831    0.067    1.000    1.000
## 
## R-Square:
##                    Estimate
##     vocab             0.773
##     simil             0.337
##     inform            0.416
##     compreh           0.304
##     arith             0.714
##     digspan           0.315
##     lnseq             0.020
##     piccomp           0.356
##     block             0.517
##     matrixreason      0.244
##     digsym            0.057
##     symbolsearch      0.524
#Edit the original model
wais.model <- 'verbalcomp =~ vocab + simil + inform + compreh 
workingmemory =~ arith + digspan + lnseq
perceptorg =~ piccomp + block + matrixreason + digsym + symbolsearch'

#Analyze the model
wais.fit <- cfa(model=wais.model, data=IQdata)

#Summarize the model
summary(wais.fit, standardized=TRUE, fit.measures=TRUE, rsquare=TRUE)
## lavaan 0.6-3 ended normally after 110 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         27
## 
##   Number of observations                           300
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     252.809
##   Degrees of freedom                                51
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic             1042.916
##   Degrees of freedom                                66
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.793
##   Tucker-Lewis Index (TLI)                       0.733
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -9949.570
##   Loglikelihood unrestricted model (H1)      -9823.166
## 
##   Number of free parameters                         27
##   Akaike (AIC)                               19953.141
##   Bayesian (BIC)                             20053.143
##   Sample-size adjusted Bayesian (BIC)        19967.515
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.115
##   90 Percent Confidence Interval          0.101  0.129
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.076
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   verbalcomp =~                                                         
##     vocab             1.000                               6.281    0.879
##     simil             0.296    0.031    9.483    0.000    1.861    0.581
##     inform            0.449    0.043   10.481    0.000    2.822    0.644
##     compreh           0.315    0.035    8.999    0.000    1.981    0.552
##   workingmemory =~                                                      
##     arith             1.000                               2.528    0.844
##     digspan           0.881    0.152    5.786    0.000    2.227    0.565
##     lnseq             0.205    0.107    1.920    0.055    0.518    0.129
##   perceptorg =~                                                         
##     piccomp           1.000                               1.517    0.650
##     block             3.739    0.390    9.583    0.000    5.672    0.735
##     matrixreason      0.832    0.117    7.099    0.000    1.262    0.493
##     digsym            1.603    0.507    3.160    0.002    2.431    0.207
##     symbolsearch      1.880    0.204    9.236    0.000    2.852    0.690
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   verbalcomp ~~                                                         
##     workingmemory     6.132    1.234    4.970    0.000    0.386    0.386
##     perceptorg        5.892    0.886    6.647    0.000    0.618    0.618
##   workingmemory ~~                                                      
##     perceptorg        2.227    0.362    6.149    0.000    0.581    0.581
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .vocab            11.577    2.651    4.367    0.000   11.577    0.227
##    .simil             6.787    0.620   10.950    0.000    6.787    0.662
##    .inform           11.218    1.085   10.342    0.000   11.218    0.585
##    .compreh           8.962    0.803   11.155    0.000    8.962    0.696
##    .arith             2.571    1.014    2.535    0.011    2.571    0.287
##    .digspan          10.590    1.161    9.121    0.000   10.590    0.681
##    .lnseq            15.807    1.297   12.183    0.000   15.807    0.983
##    .piccomp           3.138    0.317    9.913    0.000    3.138    0.577
##    .block            27.343    3.226    8.476    0.000   27.343    0.459
##    .matrixreason      4.960    0.441   11.243    0.000    4.960    0.757
##    .digsym          132.291   10.925   12.109    0.000  132.291    0.957
##    .symbolsearch      8.936    0.957    9.333    0.000    8.936    0.524
##     verbalcomp       39.455    4.754    8.299    0.000    1.000    1.000
##     workingmemory     6.388    1.215    5.259    0.000    1.000    1.000
##     perceptorg        2.301    0.408    5.646    0.000    1.000    1.000
## 
## R-Square:
##                    Estimate
##     vocab             0.773
##     simil             0.338
##     inform            0.415
##     compreh           0.304
##     arith             0.713
##     digspan           0.319
##     lnseq             0.017
##     piccomp           0.423
##     block             0.541
##     matrixreason      0.243
##     digsym            0.043
##     symbolsearch      0.476
#Update the default picture
semPlot::semPaths(object = wais.fit, layout = "tree", rotation = 1, whatLabels = "std", 
                  edge.label.cex = 1, what = "std", edge.color = "black"
                  )

#Examine modification indices 
modificationindices(wais.fit, sort = TRUE)
##               lhs op          rhs     mi     epc sepc.lv sepc.all sepc.nox
## 66          simil ~~       inform 35.879  -3.757  -3.757   -0.431   -0.431
## 56          vocab ~~       inform 28.377   9.783   9.783    0.858    0.858
## 48     perceptorg =~        vocab 21.865  -2.077  -3.151   -0.441   -0.441
## 115         block ~~ matrixreason 16.209  -3.622  -3.622   -0.311   -0.311
## 96          arith ~~        block 15.061   3.679   3.679    0.439    0.439
## 117         block ~~ symbolsearch 13.144   5.725   5.725    0.366    0.366
## 47  workingmemory =~ symbolsearch 12.272  -0.467  -1.181   -0.286   -0.286
## 81         inform ~~        block 12.269   4.358   4.358    0.249    0.249
## 64          vocab ~~       digsym 11.578 -11.261 -11.261   -0.288   -0.288
## 40  workingmemory =~        simil 11.383   0.278   0.703    0.220    0.220
## 72          simil ~~        block 10.605  -3.084  -3.084   -0.226   -0.226
## 45  workingmemory =~ matrixreason  9.685   0.267   0.675    0.264    0.264
## 95          arith ~~      piccomp  9.463  -0.892  -0.892   -0.314   -0.314
## 60          vocab ~~        lnseq  9.425  -3.486  -3.486   -0.258   -0.258
## 67          simil ~~      compreh  9.356   1.587   1.587    0.203    0.203
## 44  workingmemory =~        block  9.258   0.765   1.933    0.251    0.251
## 51     perceptorg =~      compreh  9.177   0.601   0.912    0.254    0.254
## 62          vocab ~~        block  8.712  -5.377  -5.377   -0.302   -0.302
## 73          simil ~~ matrixreason  8.672   1.065   1.065    0.184    0.184
## 106         lnseq ~~      piccomp  8.620   1.298   1.298    0.184    0.184
## 91        compreh ~~       digsym  8.155   5.908   5.908    0.172    0.172
## 59          vocab ~~      digspan  8.127   2.849   2.849    0.257    0.257
## 37     verbalcomp =~       digsym  7.803  -0.464  -2.917   -0.248   -0.248
## 68          simil ~~        arith  7.534   1.064   1.064    0.255    0.255
## 99          arith ~~ symbolsearch  7.468  -1.391  -1.391   -0.290   -0.290
## 57          vocab ~~      compreh  7.107  -3.508  -3.508   -0.344   -0.344
## 87        compreh ~~        lnseq  7.001   1.887   1.887    0.159    0.159
## 97          arith ~~ matrixreason  6.391   0.848   0.848    0.237    0.237
## 107         lnseq ~~        block  5.677   3.289   3.289    0.158    0.158
## 34     verbalcomp =~      piccomp  5.507   0.071   0.447    0.192    0.192
## 78         inform ~~      digspan  5.435  -1.649  -1.649   -0.151   -0.151
## 33     verbalcomp =~        lnseq  5.250  -0.104  -0.652   -0.163   -0.163
## 54     perceptorg =~        lnseq  4.644   0.512   0.777    0.194    0.194
## 39  workingmemory =~        vocab  4.638  -0.406  -1.025   -0.143   -0.143
## 102       digspan ~~        block  4.564  -2.689  -2.689   -0.158   -0.158
## 35     verbalcomp =~        block  4.551  -0.218  -1.371   -0.178   -0.178
## 88        compreh ~~      piccomp  4.455   0.728   0.728    0.137    0.137
## 112       piccomp ~~ matrixreason  4.306   0.568   0.568    0.144    0.144
## 101       digspan ~~      piccomp  4.218   0.808   0.808    0.140    0.140
## 46  workingmemory =~       digsym  4.139  -0.852  -2.152   -0.183   -0.183
## 71          simil ~~      piccomp  4.029   0.607   0.607    0.132    0.132
## 76         inform ~~      compreh  3.789  -1.367  -1.367   -0.136   -0.136
## 70          simil ~~        lnseq  3.693  -1.200  -1.200   -0.116   -0.116
## 50     perceptorg =~       inform  3.487   0.444   0.673    0.154    0.154
## 58          vocab ~~        arith  3.451  -1.457  -1.457   -0.267   -0.267
## 55          vocab ~~        simil  3.393   2.239   2.239    0.253    0.253
## 113       piccomp ~~       digsym  3.375   2.419   2.419    0.119    0.119
## 93          arith ~~      digspan  3.274   7.960   7.960    1.526    1.526
## 86        compreh ~~      digspan  3.234  -1.110  -1.110   -0.114   -0.114
## 80         inform ~~      piccomp  2.871  -0.672  -0.672   -0.113   -0.113
## 104       digspan ~~       digsym  2.754  -3.822  -3.822   -0.102   -0.102
## 114       piccomp ~~ symbolsearch  2.677  -0.731  -0.731   -0.138   -0.138
## 89        compreh ~~        block  2.551   1.725   1.725    0.110    0.110
## 90        compreh ~~ matrixreason  2.342  -0.632  -0.632   -0.095   -0.095
## 74          simil ~~       digsym  2.021  -2.575  -2.575   -0.086   -0.086
## 43  workingmemory =~      piccomp  1.899  -0.104  -0.262   -0.113   -0.113
## 49     perceptorg =~        simil  1.675   0.227   0.345    0.108    0.108
## 92        compreh ~~ symbolsearch  1.646   0.764   0.764    0.085    0.085
## 111       piccomp ~~        block  1.591  -1.084  -1.084   -0.117   -0.117
## 85        compreh ~~        arith  1.350  -0.514  -0.514   -0.107   -0.107
## 32     verbalcomp =~      digspan  1.224   0.058   0.365    0.092    0.092
## 79         inform ~~        lnseq  0.998  -0.815  -0.815   -0.061   -0.061
## 69          simil ~~      digspan  0.996   0.540   0.540    0.064    0.064
## 53     perceptorg =~      digspan  0.942  -0.710  -1.077   -0.273   -0.273
## 77         inform ~~        arith  0.890   0.480   0.480    0.089    0.089
## 116         block ~~       digsym  0.805   3.770   3.770    0.063    0.063
## 120        digsym ~~ symbolsearch  0.724   1.948   1.948    0.057    0.057
## 100       digspan ~~        lnseq  0.703  -0.688  -0.688   -0.053   -0.053
## 83         inform ~~       digsym  0.667   1.935   1.935    0.050    0.050
## 36     verbalcomp =~ matrixreason  0.543   0.025   0.159    0.062    0.062
## 61          vocab ~~      piccomp  0.529   0.414   0.414    0.069    0.069
## 105       digspan ~~ symbolsearch  0.481  -0.475  -0.475   -0.049   -0.049
## 52     perceptorg =~        arith  0.478  -0.694  -1.052   -0.352   -0.352
## 98          arith ~~       digsym  0.474  -1.135  -1.135   -0.062   -0.062
## 94          arith ~~        lnseq  0.430  -0.496  -0.496   -0.078   -0.078
## 31     verbalcomp =~        arith  0.237  -0.029  -0.182   -0.061   -0.061
## 103       digspan ~~ matrixreason  0.226   0.221   0.221    0.030    0.030
## 42  workingmemory =~      compreh  0.190  -0.041  -0.103   -0.029   -0.029
## 75          simil ~~ symbolsearch  0.188  -0.227  -0.227   -0.029   -0.029
## 63          vocab ~~ matrixreason  0.143  -0.253  -0.253   -0.033   -0.033
## 109         lnseq ~~       digsym  0.128  -0.951  -0.951   -0.021   -0.021
## 38     verbalcomp =~ symbolsearch  0.077   0.015   0.094    0.023    0.023
## 118  matrixreason ~~       digsym  0.060  -0.380  -0.380   -0.015   -0.015
## 41  workingmemory =~       inform  0.037   0.021   0.053    0.012    0.012
## 119  matrixreason ~~ symbolsearch  0.031  -0.085  -0.085   -0.013   -0.013
## 108         lnseq ~~ matrixreason  0.017   0.069   0.069    0.008    0.008
## 110         lnseq ~~ symbolsearch  0.009   0.072   0.072    0.006    0.006
## 65          vocab ~~ symbolsearch  0.005  -0.068  -0.068   -0.007   -0.007
## 84         inform ~~ symbolsearch  0.004  -0.045  -0.045   -0.004   -0.004
## 82         inform ~~ matrixreason  0.004   0.029   0.029    0.004    0.004
#Update the three-factor model
wais.model2 <- 'verbalcomp =~ vocab + simil + inform + compreh 
workingmemory =~ arith + digspan + lnseq
perceptorg =~ piccomp + block + matrixreason + digsym + symbolsearch
simil ~~ inform'

#Analyze the three-factor model where data is IQdata
wais.fit2 <- cfa(model=wais.model2, data=IQdata)

#Summarize the three-factor model 
summary(wais.fit2, standardized=TRUE, fit.measures=TRUE, rsquare=TRUE)
## lavaan 0.6-3 ended normally after 114 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         28
## 
##   Number of observations                           300
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     212.813
##   Degrees of freedom                                50
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic             1042.916
##   Degrees of freedom                                66
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.833
##   Tucker-Lewis Index (TLI)                       0.780
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -9929.572
##   Loglikelihood unrestricted model (H1)      -9823.166
## 
##   Number of free parameters                         28
##   Akaike (AIC)                               19915.144
##   Bayesian (BIC)                             20018.850
##   Sample-size adjusted Bayesian (BIC)        19930.051
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.104
##   90 Percent Confidence Interval          0.090  0.119
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.071
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   verbalcomp =~                                                         
##     vocab             1.000                               5.888    0.824
##     simil             0.361    0.035   10.184    0.000    2.125    0.664
##     inform            0.525    0.048   10.857    0.000    3.090    0.706
##     compreh           0.334    0.036    9.349    0.000    1.965    0.547
##   workingmemory =~                                                      
##     arith             1.000                               2.565    0.857
##     digspan           0.857    0.149    5.768    0.000    2.199    0.558
##     lnseq             0.193    0.104    1.850    0.064    0.495    0.123
##   perceptorg =~                                                         
##     piccomp           1.000                               1.515    0.650
##     block             3.737    0.390    9.581    0.000    5.662    0.734
##     matrixreason      0.843    0.118    7.176    0.000    1.278    0.499
##     digsym            1.615    0.508    3.181    0.001    2.446    0.208
##     symbolsearch      1.875    0.203    9.218    0.000    2.841    0.688
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .simil ~~                                                              
##    .inform           -3.738    0.606   -6.169    0.000   -3.738   -0.503
##   verbalcomp ~~                                                         
##     workingmemory     6.278    1.181    5.315    0.000    0.416    0.416
##     perceptorg        5.654    0.859    6.583    0.000    0.634    0.634
##   workingmemory ~~                                                      
##     perceptorg        2.237    0.363    6.172    0.000    0.576    0.576
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .vocab            16.365    2.375    6.892    0.000   16.365    0.321
##    .simil             5.734    0.610    9.399    0.000    5.734    0.560
##    .inform            9.635    1.095    8.801    0.000    9.635    0.502
##    .compreh           9.026    0.791   11.413    0.000    9.026    0.700
##    .arith             2.380    1.037    2.294    0.022    2.380    0.266
##    .digspan          10.715    1.154    9.282    0.000   10.715    0.689
##    .lnseq            15.830    1.298   12.193    0.000   15.830    0.985
##    .piccomp           3.143    0.316    9.937    0.000    3.143    0.578
##    .block            27.457    3.220    8.527    0.000   27.457    0.461
##    .matrixreason      4.921    0.439   11.216    0.000    4.921    0.751
##    .digsym          132.218   10.920   12.108    0.000  132.218    0.957
##    .symbolsearch      8.996    0.958    9.393    0.000    8.996    0.527
##     verbalcomp       34.667    4.408    7.865    0.000    1.000    1.000
##     workingmemory     6.579    1.239    5.309    0.000    1.000    1.000
##     perceptorg        2.296    0.407    5.643    0.000    1.000    1.000
## 
## R-Square:
##                    Estimate
##     vocab             0.679
##     simil             0.440
##     inform            0.498
##     compreh           0.300
##     arith             0.734
##     digspan           0.311
##     lnseq             0.015
##     piccomp           0.422
##     block             0.539
##     matrixreason      0.249
##     digsym            0.043
##     symbolsearch      0.473
#Compare the models
anova(wais.fit, wais.fit2)
## Chi Square Difference Test
## 
##           Df   AIC   BIC  Chisq Chisq diff Df diff Pr(>Chisq)    
## wais.fit2 50 19915 20019 212.81                                  
## wais.fit  51 19953 20053 252.81     39.996       1  2.545e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#View the fit indices for the original model
fitmeasures(wais.fit, c("aic", "ecvi"))
##       aic      ecvi 
## 19953.141     1.023
#View the fit indices for the updated model
fitmeasures(wais.fit2, c("aic", "ecvi"))
##       aic      ecvi 
## 19915.144     0.896
#Update the three-factor model to a hierarchical model
wais.model3 <- 'verbalcomp =~ vocab + simil + inform + compreh 
workingmemory =~ arith + digspan + lnseq
perceptorg =~ piccomp + block + matrixreason + digsym + symbolsearch
simil ~~ inform
general =~ verbalcomp + workingmemory + perceptorg'

#Analyze the hierarchical model where data is IQdata
wais.fit3 <- cfa(model = wais.model3, data = IQdata)

#Examine the fit indices for the old model
fitmeasures(wais.fit2, c("rmsea", "srmr"))
## rmsea  srmr 
## 0.104 0.071
#Examine the fit indices for the new model
fitmeasures(wais.fit3, c("rmsea", "srmr"))
## rmsea  srmr 
## 0.104 0.071
#Update the default picture
semPlot::semPaths(object = wais.fit3, layout = "tree", rotation = 1, whatLabels = "std", 
                  edge.label.cex = 1, what = "std", edge.color = "navy"
                  )


Working with Data in the Tidyverse

Chapter 1 - Explore Data

Import data:

  • Begging steps of the pipeline include importing, tidying, and transforming (wrangling)
  • Focus of this course will be recatngular data including both columns (variables) and rows (observations)
    • bakers # 10x6 tibble
    • tibbles are a special type of data frame - both store rectangular data in R
  • Can read the data using readr::read_csv()
    • ?read_csv
    • bakers <- read_csv(“bakers.csv”)
    • bakers # same 10x6 tibble

Know data:

  • The bakeoff data includes three types of challenges - Signature, Technical, Showstopper
  • Tibble printing by default will cut off columns and just show the variables - glimpse from dplyr can help with visualizing
    • glimpse(bakers_mini)
    • library(skimr)
    • skim(bakers_mini) # skim provides statistics for every column depending on the variable types

Count data - broken video that provides some code snippets:

  • bakers %>% distinct(series)
  • bakers %>% count(series)
  • bakers %>% group_by(series) %>% summarize(n = n())
  • bakers %>% count(aired_us, series)
  • bakers %>% count(aired_us, series) %>% mutate(prop_bakers = n/sum(n))
  • bakers %>% group_by(aired_us, series) %>% summarize(n = n()) %>% mutate(prop_bakers = n/sum(n))
  • bakers %>% count(aired_us, series) %>% count(aired_us)

Example code includes:

# Read in "bakeoff.csv" as bakeoff
bakeoff <- readr::read_csv("./RInputFiles/bakeoff.csv")
## Parsed with column specification:
## cols(
##   series = col_double(),
##   episode = col_double(),
##   baker = col_character(),
##   signature = col_character(),
##   technical = col_double(),
##   showstopper = col_character(),
##   result = col_character(),
##   uk_airdate = col_date(format = ""),
##   us_season = col_double(),
##   us_airdate = col_date(format = "")
## )
# Print bakeoff
bakeoff
## # A tibble: 549 x 10
##    series episode baker signature technical showstopper result uk_airdate
##     <dbl>   <dbl> <chr> <chr>         <dbl> <chr>       <chr>  <date>    
##  1      1       1 Anne~ "Light J~         2 Chocolate ~ IN     2010-08-17
##  2      1       1 David Chocolat~         3 "Black For~ IN     2010-08-17
##  3      1       1 Edd   Caramel ~         1 <NA>        IN     2010-08-17
##  4      1       1 Jasm~ Fresh Ma~        NA <NA>        IN     2010-08-17
##  5      1       1 Jona~ Carrot C~         9 Three-laye~ IN     2010-08-17
##  6      1       1 Loui~ Carrot a~        NA "Never Fai~ IN     2010-08-17
##  7      1       1 Mira~ "Triple ~         8 "Three Tie~ IN     2010-08-17
##  8      1       1 Ruth  "Lemon D~        NA "Classic C~ IN     2010-08-17
##  9      1       1 Lea   "Cranber~        10 "Chocolate~ OUT    2010-08-17
## 10      1       1 Mark  Sticky M~        NA Heart-shap~ OUT    2010-08-17
## # ... with 539 more rows, and 2 more variables: us_season <dbl>,
## #   us_airdate <date>
# Data set above is already OK - UNKNOWN are NA in CSV
# Filter rows where showstopper is UNKNOWN 
bakeoff %>% 
    filter(showstopper == "UNKNOWN")
## # A tibble: 0 x 10
## # ... with 10 variables: series <dbl>, episode <dbl>, baker <chr>,
## #   signature <chr>, technical <dbl>, showstopper <chr>, result <chr>,
## #   uk_airdate <date>, us_season <dbl>, us_airdate <date>
# Edit to add list of missing values
bakeoff <- read_csv("./RInputFiles/bakeoff.csv", na = c("", "NA", "UNKNOWN"))
## Parsed with column specification:
## cols(
##   series = col_double(),
##   episode = col_double(),
##   baker = col_character(),
##   signature = col_character(),
##   technical = col_double(),
##   showstopper = col_character(),
##   result = col_character(),
##   uk_airdate = col_date(format = ""),
##   us_season = col_double(),
##   us_airdate = col_date(format = "")
## )
# Filter rows where showstopper is NA 
bakeoff %>% 
    filter(is.na(showstopper))
## # A tibble: 21 x 10
##    series episode baker signature technical showstopper result uk_airdate
##     <dbl>   <dbl> <chr> <chr>         <dbl> <chr>       <chr>  <date>    
##  1      1       1 Edd   Caramel ~         1 <NA>        IN     2010-08-17
##  2      1       1 Jasm~ Fresh Ma~        NA <NA>        IN     2010-08-17
##  3      1       6 Mira~ Lemon Cu~        NA <NA>        RUNNE~ 2010-09-21
##  4      2       1 Ian   Apple an~        10 <NA>        IN     2011-08-16
##  5      2       1 Jason "Lemon M~         6 <NA>        IN     2011-08-16
##  6      2       1 Urva~ Cherry B~         7 <NA>        IN     2011-08-16
##  7      2       1 Yasm~ Cardamom~         5 <NA>        IN     2011-08-16
##  8      2       1 Holly "Cherry ~         1 <NA>        SB     2011-08-16
##  9      2       2 Ben   Chorizo,~         1 <NA>        IN     2011-08-23
## 10      2       2 Ian   "Stilton~         2 <NA>        IN     2011-08-23
## # ... with 11 more rows, and 2 more variables: us_season <dbl>,
## #   us_airdate <date>
# Edit to filter, group by, and skim
bakeoff %>% 
  filter(!is.na(us_season)) %>% 
  group_by(us_season) %>%
  skimr::skim()
## Skim summary statistics
##  n obs: 302 
##  n variables: 10 
##  group variables: us_season 
## 
## -- Variable type:character ---------------------------------------------------------------------------------------
##  us_season    variable missing complete  n min max empty n_unique
##          1       baker       0       78 78   3   9     0       13
##          1      result       0       78 78   2   9     0        5
##          1 showstopper       0       78 78   5 126     0       78
##          1   signature       0       78 78  10 125     0       78
##          2       baker       0       74 74   4   7     0       12
##          2      result       0       74 74   2   9     0        6
##          2 showstopper       1       73 74   8  82     0       73
##          2   signature       1       73 74  15 107     0       73
##          3       baker       0       75 75   3   6     0       12
##          3      result       0       75 75   2   9     0        5
##          3 showstopper       0       75 75  10  70     0       73
##          3   signature       0       75 75  12  64     0       74
##          4       baker       0       75 75   3   9     0       12
##          4      result       0       75 75   2   9     0        5
##          4 showstopper       0       75 75   5  86     0       74
##          4   signature       0       75 75  12  93     0       75
## 
## -- Variable type:Date --------------------------------------------------------------------------------------------
##  us_season   variable missing complete  n        min        max     median
##          1 uk_airdate       0       78 78 2013-08-20 2013-10-22 2013-09-10
##          1 us_airdate       0       78 78 2014-12-28 2015-03-01 2015-01-18
##          2 uk_airdate       0       74 74 2014-08-06 2014-10-08 2014-08-27
##          2 us_airdate       0       74 74 2015-09-06 2015-11-08 2015-09-27
##          3 uk_airdate       0       75 75 2015-08-05 2015-10-07 2015-08-26
##          3 us_airdate       0       75 75 2016-07-01 2016-08-12 2016-07-15
##          4 uk_airdate       0       75 75 2016-08-24 2016-10-26 2016-09-14
##          4 us_airdate       0       75 75 2017-06-16 2017-08-04 2017-06-30
##  n_unique
##        10
##        10
##        10
##        10
##        10
##         7
##        10
##         8
## 
## -- Variable type:numeric -----------------------------------------------------------------------------------------
##  us_season  variable missing complete  n mean   sd p0  p25 p50 p75 p100
##          1   episode       0       78 78 4.31 2.66  1 2    4     6   10
##          1    series       0       78 78 4    0     4 4    4     4    4
##          1 technical       0       78 78 5.08 3.19  1 2.25 4.5   7   13
##          2   episode       0       74 74 4.38 2.68  1 2    4     6   10
##          2    series       0       74 74 5    0     5 5    5     5    5
##          2 technical       1       73 74 4.73 2.93  1 2    4     7   12
##          3   episode       0       75 75 4.4  2.67  1 2    4     6   10
##          3    series       0       75 75 6    0     6 6    6     6    6
##          3 technical       0       75 75 4.8  2.92  1 2    4     7   12
##          4   episode       0       75 75 4.4  2.67  1 2    4     6   10
##          4    series       0       75 75 7    0     7 7    7     7    7
##          4 technical       0       75 75 4.8  2.92  1 2    4     7   12
##      hist
##  <U+2587><U+2583><U+2583><U+2582><U+2582><U+2582><U+2582><U+2582>
##  <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##  <U+2587><U+2587><U+2583><U+2585><U+2582><U+2583><U+2581><U+2581>
##  <U+2587><U+2583><U+2583><U+2583><U+2582><U+2582><U+2582><U+2582>
##  <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##  <U+2587><U+2583><U+2587><U+2583><U+2582><U+2583><U+2581><U+2581>
##  <U+2587><U+2583><U+2583><U+2583><U+2582><U+2582><U+2582><U+2582>
##  <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##  <U+2587><U+2583><U+2587><U+2583><U+2582><U+2583><U+2581><U+2581>
##  <U+2587><U+2583><U+2583><U+2583><U+2582><U+2582><U+2582><U+2582>
##  <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##  <U+2587><U+2583><U+2587><U+2583><U+2582><U+2583><U+2581><U+2581>
bakeoff %>% 
  distinct(result)
## # A tibble: 6 x 1
##   result   
##   <chr>    
## 1 IN       
## 2 OUT      
## 3 RUNNER UP
## 4 WINNER   
## 5 SB       
## 6 LEFT
# Count rows by distinct results
bakeoff %>% 
  count(result)
## # A tibble: 6 x 2
##   result        n
##   <chr>     <int>
## 1 IN          393
## 2 LEFT          1
## 3 OUT          70
## 4 RUNNER UP    16
## 5 SB           61
## 6 WINNER        8
# Count whether or not star baker
bakeoff %>% 
  count(result=="SB")
## # A tibble: 2 x 2
##   `result == "SB"`     n
##   <lgl>            <int>
## 1 FALSE              488
## 2 TRUE                61
# Count the number of rows by series and episode
bakeoff %>%
  count(series, episode)
## # A tibble: 74 x 3
##    series episode     n
##     <dbl>   <dbl> <int>
##  1      1       1    10
##  2      1       2     8
##  3      1       3     6
##  4      1       4     5
##  5      1       5     4
##  6      1       6     3
##  7      2       1    12
##  8      2       2    11
##  9      2       3    10
## 10      2       4     8
## # ... with 64 more rows
# Add second count by series
bakeoff %>% 
  count(series, episode) %>%
  count(series)
## # A tibble: 8 x 2
##   series     n
##    <dbl> <int>
## 1      1     6
## 2      2     8
## 3      3    10
## 4      4    10
## 5      5    10
## 6      6    10
## 7      7    10
## 8      8    10
# Count the number of rows by series and baker
bakers_by_series <- 
  bakeoff %>%
  count(series, baker)

# Print to view
bakers_by_series
## # A tibble: 95 x 3
##    series baker         n
##     <dbl> <chr>     <int>
##  1      1 Annetha       2
##  2      1 David         4
##  3      1 Edd           6
##  4      1 Jasminder     5
##  5      1 Jonathan      3
##  6      1 Lea           1
##  7      1 Louise        2
##  8      1 Mark          1
##  9      1 Miranda       6
## 10      1 Ruth          6
## # ... with 85 more rows
# Count again by series
bakers_by_series %>%
  count(series)
## # A tibble: 8 x 2
##   series     n
##    <dbl> <int>
## 1      1    10
## 2      2    12
## 3      3    12
## 4      4    13
## 5      5    12
## 6      6    12
## 7      7    12
## 8      8    12
# Count again by baker
bakers_by_series %>%
  count(baker, sort=TRUE)
## # A tibble: 86 x 2
##    baker      n
##    <chr>  <int>
##  1 Kate       3
##  2 Ian        2
##  3 James      2
##  4 Louise     2
##  5 Mark       2
##  6 Peter      2
##  7 Robert     2
##  8 Tom        2
##  9 Ali        1
## 10 Alvin      1
## # ... with 76 more rows
ggplot(bakeoff, aes(x=episode)) + 
    geom_bar() + 
    facet_wrap(~series)


Chapter 2 - Tame Data

Cast column types:

  • Type-casting can be an important step in taming data
  • The readr package has options for col_type within the read_csv() function
    • By default, all of the column types are guessed from the first 1,000 rows
    • bakers_raw %>% dplyr::slice(1:4) # look at the first 4 rows
  • Can convert a character to a number using parse_number()
    • parse_number(“36 years”) # will become 36
    • bakers_tame <- read_csv(file = “bakers.csv”, col_types = cols(age = col_number()) ) # col_number() will wrangle the age column to a numeric
  • Can also use the parse_date capability to manage datetime inputs
    • parse_date(“14 August 2012”, format = “%d %B %Y”)
    • bakers <- read_csv(“bakers.csv”, col_types = cols( last_date_uk = col_date(format = “%d %B %Y”) )) # col_date() will wrangle last_date_uk to a datetime
  • There is always both a parse_* and a col_* for any given data type; can practive with parse_* then use col_* in the read-in

Recode values:

  • The recode() function in dplyr can be used to recode values in the data
    • young_bakers %>% mutate(stu_label = recode(student, 0 = “other”, .default = “student”)) # 0 will become other, anything else will become student
    • young_bakers %>% mutate(stu_label = recode(student, 0 = NA_character_, .default = “student”)) # create NA for a specific string
    • young_bakers %>% mutate(student = na_if(student, 0)) # na_if will convert to NA if the condition(s) is met

Select variables:

  • Can select just a subset of the variables using select
  • The select() function is powerful when you only need to work with a subset of the data
    • young_bakers2 %>% select(baker, series_winner) # keep these variables
    • young_bakers2 %>% select(-technical_winner) # drop these variables (signalled by the minus sign)
  • Can use helper functions inside the select() call
    • young_bakers2 %>% select(baker, starts_with(“series”))
    • young_bakers2 %>% select(ends_with(“winner”), baker)
    • young_bakers2 %>% select(contains(“bake”))
  • The filter() function works on rows rather than columns
    • young_bakers2 %>% filter(series_winner == 1 | series_runner_up == 1)

Tame variable names:

  • Can rename variables while selecting
    • young_bakers3 %>% select(baker, tech_1 = tre1)
    • young_bakers3 %>% select(baker, tech_ = tre1:tre3)
    • young_bakers3 %>% select(baker, tech_ = starts_with(“tr”), result_ = starts_with(“rs”))
  • Within the rename call, it is not possible to use the helper functions
    • young_bakers3 %>% rename(tech_1 = t_first, result_1 = r_first) # new = old
    • young_bakers3 %>% select(everything(), tech_ = starts_with(“tr”), result_ = starts_with(“rs”)) # everything first keeps all the column orders the same
  • Can also use the janitor package to help with cleaning variables
    • young_bakers3 %>% janitor::clean_names()
    • Converts to snake case (lower case with underscores)

Example code includes:

# NOTE THAT THIS WILL THROW WARNINGS
# Try to cast technical as a number
desserts <- readr::read_csv("./RInputFiles/desserts.csv",
                      col_types = cols(
                        technical = col_number())
                     )
## Warning: 7 parsing failures.
## row       col expected actual                         file
##   4 technical a number    N/A './RInputFiles/desserts.csv'
##   6 technical a number    N/A './RInputFiles/desserts.csv'
##   8 technical a number    N/A './RInputFiles/desserts.csv'
##  10 technical a number    N/A './RInputFiles/desserts.csv'
##  34 technical a number    N/A './RInputFiles/desserts.csv'
## ... ......... ........ ...... ............................
## See problems(...) for more details.
# View parsing problems
readr::problems(desserts)
## # A tibble: 7 x 5
##     row col       expected actual file                        
##   <int> <chr>     <chr>    <chr>  <chr>                       
## 1     4 technical a number N/A    './RInputFiles/desserts.csv'
## 2     6 technical a number N/A    './RInputFiles/desserts.csv'
## 3     8 technical a number N/A    './RInputFiles/desserts.csv'
## 4    10 technical a number N/A    './RInputFiles/desserts.csv'
## 5    34 technical a number N/A    './RInputFiles/desserts.csv'
## 6    35 technical a number N/A    './RInputFiles/desserts.csv'
## 7    36 technical a number N/A    './RInputFiles/desserts.csv'
# NOTE THAT THIS WILL FIX THE ERRORS
# Edit code to fix the parsing error 
desserts <- readr::read_csv("./RInputFiles/desserts.csv",
                      col_types = cols(
                        technical = col_number()),
                      na = c("", "NA", "N/A") 
                     )

# View parsing problems
readr::problems(desserts)
## [1] row      col      expected actual  
## <0 rows> (or 0-length row.names)
# Find format to parse uk_airdate 
readr::parse_date("17 August 2010", format = "%d %B %Y")
## [1] "2010-08-17"
# Edit to cast uk_airdate
desserts <- readr::read_csv("./RInputFiles/desserts.csv", 
                     na = c("", "NA", "N/A"),
                     col_types = cols(
                       technical = col_number(),
                       uk_airdate = col_date("%d %B %Y")
                     ))

# Print by descending uk_airdate
desserts %>%
  arrange(desc(uk_airdate))
## # A tibble: 549 x 16
##    series episode baker technical result uk_airdate us_season us_airdate
##     <dbl>   <dbl> <chr>     <dbl> <chr>  <date>         <dbl> <date>    
##  1      8      10 Kate          3 RUNNE~ 2017-10-31        NA NA        
##  2      8      10 Stev~         1 RUNNE~ 2017-10-31        NA NA        
##  3      8      10 Soph~         2 WINNER 2017-10-31        NA NA        
##  4      8       9 Kate          4 IN     2017-10-24        NA NA        
##  5      8       9 Stev~         3 IN     2017-10-24        NA NA        
##  6      8       9 Stac~         2 OUT    2017-10-24        NA NA        
##  7      8       9 Soph~         1 SB     2017-10-24        NA NA        
##  8      8       8 Kate          2 IN     2017-10-17        NA NA        
##  9      8       8 Soph~         4 IN     2017-10-17        NA NA        
## 10      8       8 Stev~         1 IN     2017-10-17        NA NA        
## # ... with 539 more rows, and 8 more variables:
## #   showstopper_chocolate <chr>, showstopper_dessert <chr>,
## #   showstopper_fruit <chr>, showstopper_nut <chr>,
## #   signature_chocolate <chr>, signature_dessert <chr>,
## #   signature_fruit <chr>, signature_nut <chr>
# Cast result a factor
desserts <- readr::read_csv("./RInputFiles/desserts.csv", 
                     na = c("", "NA", "N/A"),
                     col_types = cols(
                       technical = col_number(),
                       uk_airdate = col_date(format = "%d %B %Y"),
                       result = col_factor(levels=NULL)
                     ))
                    
# Glimpse to view
glimpse(desserts)
## Observations: 549
## Variables: 16
## $ series                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ episode               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2...
## $ baker                 <chr> "Annetha", "David", "Edd", "Jasminder", ...
## $ technical             <dbl> 2, 3, 1, NA, 9, NA, 8, NA, 10, NA, 8, 6,...
## $ result                <fct> IN, IN, IN, IN, IN, IN, IN, IN, OUT, OUT...
## $ uk_airdate            <date> 2010-08-17, 2010-08-17, 2010-08-17, 201...
## $ us_season             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ us_airdate            <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ showstopper_chocolate <chr> "chocolate", "chocolate", "no chocolate"...
## $ showstopper_dessert   <chr> "other", "other", "other", "other", "oth...
## $ showstopper_fruit     <chr> "no fruit", "no fruit", "no fruit", "no ...
## $ showstopper_nut       <chr> "no nut", "no nut", "no nut", "no nut", ...
## $ signature_chocolate   <chr> "no chocolate", "chocolate", "no chocola...
## $ signature_dessert     <chr> "cake", "cake", "cake", "cake", "cake", ...
## $ signature_fruit       <chr> "no fruit", "fruit", "fruit", "fruit", "...
## $ signature_nut         <chr> "no nut", "no nut", "no nut", "no nut", ...
oldDesserts <- desserts
tempDesserts <- desserts %>%
    gather(key="type_ing", value="status", starts_with(c("showstopper")), starts_with(c("signature"))) %>%
    separate(type_ing, into=c("challenge", "ingredient"), sep="_") %>%
    spread(ingredient, status)
glimpse(tempDesserts)
## Observations: 1,098
## Variables: 13
## $ series     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ episode    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, ...
## $ baker      <chr> "Annetha", "David", "Edd", "Jasminder", "Jonathan",...
## $ technical  <dbl> 2, 3, 1, NA, 9, NA, 8, NA, 10, NA, 8, 6, 2, 1, 3, 5...
## $ result     <fct> IN, IN, IN, IN, IN, IN, IN, IN, OUT, OUT, IN, IN, I...
## $ uk_airdate <date> 2010-08-17, 2010-08-17, 2010-08-17, 2010-08-17, 20...
## $ us_season  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ us_airdate <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ challenge  <chr> "showstopper", "showstopper", "showstopper", "shows...
## $ chocolate  <chr> "chocolate", "chocolate", "no chocolate", "no choco...
## $ dessert    <chr> "other", "other", "other", "other", "other", "cake"...
## $ fruit      <chr> "no fruit", "no fruit", "no fruit", "no fruit", "fr...
## $ nut        <chr> "no nut", "no nut", "no nut", "no nut", "almond", "...
desserts <- tempDesserts


# Count rows grouping by nut variable
desserts %>%
  count(nut, sort=TRUE)
## # A tibble: 8 x 2
##   nut           n
##   <chr>     <int>
## 1 no nut      944
## 2 almond       35
## 3 walnut       35
## 4 pistachio    29
## 5 filbert      23
## 6 pecan        14
## 7 multiple      9
## 8 peanut        9
# Recode filberts as hazelnuts
desserts <- desserts %>% 
  mutate(nut = recode(nut, "filbert" = "hazelnut"))

# Count rows again 
desserts %>% 
    count(nut, sort = TRUE)
## # A tibble: 8 x 2
##   nut           n
##   <chr>     <int>
## 1 no nut      944
## 2 almond       35
## 3 walnut       35
## 4 pistachio    29
## 5 hazelnut     23
## 6 pecan        14
## 7 multiple      9
## 8 peanut        9
# Edit code to recode "no nut" as missing
desserts <- desserts %>% 
  mutate(nut = recode(nut, "filbert" = "hazelnut", 
                           "no nut" = NA_character_))

# Count rows again 
desserts %>% 
    count(nut, sort = TRUE)
## # A tibble: 8 x 2
##   nut           n
##   <chr>     <int>
## 1 <NA>        944
## 2 almond       35
## 3 walnut       35
## 4 pistachio    29
## 5 hazelnut     23
## 6 pecan        14
## 7 multiple      9
## 8 peanut        9
# Edit to recode tech_win as factor
desserts <- desserts %>% 
  mutate(tech_win = recode_factor(technical, `1` = 1,
                           .default = 0))

# Count to compare values                      
desserts %>% 
  count(technical == 1, tech_win)
## Warning: Factor `tech_win` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 3
##   `technical == 1` tech_win     n
##   <lgl>            <fct>    <int>
## 1 FALSE            0          936
## 2 TRUE             1          146
## 3 NA               <NA>        16
ratings0 <- readr::read_csv("./RInputFiles/02.03_messy_ratings.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   premiere = col_character(),
##   finale = col_character(),
##   winner = col_character(),
##   day_of_week = col_character(),
##   timeslot = col_time(format = ""),
##   channel = col_character(),
##   runner_up_1 = col_character(),
##   runner_up_2 = col_character(),
##   season_premiere = col_character(),
##   season_finale = col_character(),
##   e1_uk_airdate = col_character(),
##   e2_uk_airdate = col_character(),
##   e3_uk_airdate = col_character(),
##   e4_uk_airdate = col_character(),
##   e5_uk_airdate = col_character(),
##   e6_uk_airdate = col_character(),
##   e7_uk_airdate = col_character(),
##   e8_uk_airdate = col_character(),
##   e9_uk_airdate = col_character(),
##   e10_uk_airdate = col_character()
## )
## See spec(...) for full column specifications.
str(ratings0, give.attr=FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 8 obs. of  44 variables:
##  $ series           : num  1 2 3 4 5 6 7 8
##  $ episodes         : num  6 8 10 10 10 10 10 10
##  $ premiere         : chr  "17-Aug-10" "14-Aug-11" "14-Aug-12" "20-Aug-13" ...
##  $ finale           : chr  "21-Sep-10" "4-Oct-11" "16-Oct-12" "22-Oct-13" ...
##  $ winner           : chr  "Edd Kimber" "Joanne Wheatley" "John Whaite" "Frances Quinn" ...
##  $ avg_uk_viewers   : num  2.77 4 5 7.35 10.04 ...
##  $ day_of_week      : chr  "Tuesday" "Tuesday" "Tuesday" "Tuesday" ...
##  $ timeslot         : 'hms' num  20:00:00 20:00:00 20:00:00 20:00:00 ...
##  $ channel          : chr  "BBC Two" "BBC Two" "BBC Two" "BBC Two" ...
##  $ runner_up_1      : chr  "Miranda Gore Browne" "Holly Bell" "Brendan Lynch" "Kimberley Wilson" ...
##  $ runner_up_2      : chr  "Ruth Clemens" "Mary-Anne Boermans" "James Morton" "Ruby Tandoh" ...
##  $ season           : num  NA NA NA 1 2 3 4 NA
##  $ season_premiere  : chr  NA NA NA "12/28/14" ...
##  $ season_finale    : chr  NA NA NA "3/1/15" ...
##  $ e1_viewers_7day  : num  2.24 3.1 3.85 6.6 8.51 ...
##  $ e1_viewers_28day : num  NA NA NA NA NA ...
##  $ e2_viewers_7day  : num  3 3.53 4.6 6.65 8.79 ...
##  $ e2_viewers_28day : num  NA NA NA NA NA ...
##  $ e3_viewers_7day  : num  3 3.82 4.53 7.17 9.28 ...
##  $ e3_viewers_28day : num  NA NA NA NA NA ...
##  $ e4_viewers_7day  : num  2.6 3.6 4.71 6.82 10.25 ...
##  $ e4_viewers_28day : num  NA NA NA NA NA ...
##  $ e5_viewers_7day  : num  3.03 3.83 4.61 6.95 9.95 ...
##  $ e5_viewers_28day : num  NA NA NA NA NA ...
##  $ e6_viewers_7day  : num  2.75 4.25 4.82 7.32 10.13 ...
##  $ e6_viewers_28day : num  NA NA NA NA NA ...
##  $ e7_viewers_7day  : num  NA 4.42 5.1 7.76 10.28 ...
##  $ e7_viewers_28day : num  NA NA NA NA NA ...
##  $ e8_viewers_7day  : num  NA 5.06 5.35 7.41 9.02 ...
##  $ e8_viewers_28day : num  NA NA NA NA NA ...
##  $ e9_viewers_7day  : num  NA NA 5.7 7.41 10.67 ...
##  $ e9_viewers_28day : num  NA NA NA NA NA ...
##  $ e10_viewers_7day : num  NA NA 6.74 9.45 13.51 ...
##  $ e10_viewers_28day: num  NA NA NA NA NA ...
##  $ e1_uk_airdate    : chr  "8/17/10" "8/16/11" "8/14/12" "8/20/13" ...
##  $ e2_uk_airdate    : chr  "8/24/10" "8/23/11" "8/21/12" "8/27/13" ...
##  $ e3_uk_airdate    : chr  "8/31/10" "8/30/11" "8/28/12" "9/3/13" ...
##  $ e4_uk_airdate    : chr  "9/7/10" "9/6/11" "9/4/12" "9/10/13" ...
##  $ e5_uk_airdate    : chr  "9/14/10" "9/13/11" "9/11/12" "9/17/13" ...
##  $ e6_uk_airdate    : chr  "9/21/10" "9/20/11" "9/18/12" "9/24/13" ...
##  $ e7_uk_airdate    : chr  NA "9/27/11" "9/25/12" "10/1/13" ...
##  $ e8_uk_airdate    : chr  NA "10/4/11" "10/2/12" "10/8/13" ...
##  $ e9_uk_airdate    : chr  NA NA "10/9/12" "10/15/13" ...
##  $ e10_uk_airdate   : chr  NA NA "10/16/12" "10/22/13" ...
ratings <- ratings0 %>%
    filter(series >= 3) %>%
    rename(day=day_of_week) %>%
    mutate(series=factor(series), 
           season_premiere=lubridate::mdy(season_premiere), 
           season_finale=lubridate::mdy(season_finale), 
           viewer_growth = (e10_viewers_7day - e1_viewers_7day)
           ) %>%
    select(-contains("uk_airdate"))


# Recode channel as dummy: bbc (1) or not (0)
ratings <- ratings %>% 
  mutate(bbc = recode_factor(channel, "Channel 4"=0, .default=1))

# Look at the variables to plot next
ratings %>% select(series, channel, bbc, viewer_growth)
## # A tibble: 6 x 4
##   series channel   bbc   viewer_growth
##   <fct>  <chr>     <fct>         <dbl>
## 1 3      BBC Two   1             2.89 
## 2 4      BBC Two   1             2.85 
## 3 5      BBC One   1             5    
## 4 6      BBC One   1             3.43 
## 5 7      BBC One   1             2.32 
## 6 8      Channel 4 0             0.580
# Make a filled bar chart
ggplot(ratings, aes(x = series, y = viewer_growth, fill = bbc)) +
  geom_col()

# Move channel to first column
ratings %>% 
  select(channel, everything())
## # A tibble: 6 x 36
##   channel series episodes premiere finale winner avg_uk_viewers day  
##   <chr>   <fct>     <dbl> <chr>    <chr>  <chr>           <dbl> <chr>
## 1 BBC Two 3            10 14-Aug-~ 16-Oc~ John ~           5    Tues~
## 2 BBC Two 4            10 20-Aug-~ 22-Oc~ Franc~           7.35 Tues~
## 3 BBC One 5            10 6-Aug-14 8-Oct~ Nancy~          10.0  Wedn~
## 4 BBC One 6            10 5-Aug-15 7-Oct~ Nadiy~          12.5  Wedn~
## 5 BBC One 7            10 24-Aug-~ 26-Oc~ Candi~          13.8  Wedn~
## 6 Channe~ 8            10 29-Aug-~ 31-Oc~ Sophi~           9.29 Tues~
## # ... with 28 more variables: timeslot <drtn>, runner_up_1 <chr>,
## #   runner_up_2 <chr>, season <dbl>, season_premiere <date>,
## #   season_finale <date>, e1_viewers_7day <dbl>, e1_viewers_28day <dbl>,
## #   e2_viewers_7day <dbl>, e2_viewers_28day <dbl>, e3_viewers_7day <dbl>,
## #   e3_viewers_28day <dbl>, e4_viewers_7day <dbl>, e4_viewers_28day <dbl>,
## #   e5_viewers_7day <dbl>, e5_viewers_28day <dbl>, e6_viewers_7day <dbl>,
## #   e6_viewers_28day <dbl>, e7_viewers_7day <dbl>, e7_viewers_28day <dbl>,
## #   e8_viewers_7day <dbl>, e8_viewers_28day <dbl>, e9_viewers_7day <dbl>,
## #   e9_viewers_28day <dbl>, e10_viewers_7day <dbl>,
## #   e10_viewers_28day <dbl>, viewer_growth <dbl>, bbc <fct>
# Edit to drop 7- and 28-day episode viewer data
ratings %>% 
  select(-ends_with("day"))
## # A tibble: 6 x 15
##   series episodes premiere finale winner avg_uk_viewers timeslot channel
##   <fct>     <dbl> <chr>    <chr>  <chr>           <dbl> <drtn>   <chr>  
## 1 3            10 14-Aug-~ 16-Oc~ John ~           5    20:00    BBC Two
## 2 4            10 20-Aug-~ 22-Oc~ Franc~           7.35 20:00    BBC Two
## 3 5            10 6-Aug-14 8-Oct~ Nancy~          10.0  20:00    BBC One
## 4 6            10 5-Aug-15 7-Oct~ Nadiy~          12.5  20:00    BBC One
## 5 7            10 24-Aug-~ 26-Oc~ Candi~          13.8  20:00    BBC One
## 6 8            10 29-Aug-~ 31-Oc~ Sophi~           9.29 20:00    Channe~
## # ... with 7 more variables: runner_up_1 <chr>, runner_up_2 <chr>,
## #   season <dbl>, season_premiere <date>, season_finale <date>,
## #   viewer_growth <dbl>, bbc <fct>
# Edit to move channel to first and drop episode viewer data
ratings %>% 
  select(-ends_with("day")) %>%
  select(channel, everything())
## # A tibble: 6 x 15
##   channel series episodes premiere finale winner avg_uk_viewers timeslot
##   <chr>   <fct>     <dbl> <chr>    <chr>  <chr>           <dbl> <drtn>  
## 1 BBC Two 3            10 14-Aug-~ 16-Oc~ John ~           5    20:00   
## 2 BBC Two 4            10 20-Aug-~ 22-Oc~ Franc~           7.35 20:00   
## 3 BBC One 5            10 6-Aug-14 8-Oct~ Nancy~          10.0  20:00   
## 4 BBC One 6            10 5-Aug-15 7-Oct~ Nadiy~          12.5  20:00   
## 5 BBC One 7            10 24-Aug-~ 26-Oc~ Candi~          13.8  20:00   
## 6 Channe~ 8            10 29-Aug-~ 31-Oc~ Sophi~           9.29 20:00   
## # ... with 7 more variables: runner_up_1 <chr>, runner_up_2 <chr>,
## #   season <dbl>, season_premiere <date>, season_finale <date>,
## #   viewer_growth <dbl>, bbc <fct>
# Glimpse messy names
# glimpse(messy_ratings)

# Reformat to lower camelcase
# ratings <- messy_ratings %>%
#   clean_names(case="lower_camel")
    
# Glimpse cleaned names
# glimpse(ratings)

# Reformat to snake case
# ratings <- messy_ratings %>% 
#     clean_names("snake")

# Glimpse cleaned names
# glimpse(ratings)


# Select 7-day viewer data by series
viewers_7day <- ratings %>%
  select(series, contains("7day"))

# Glimpse
glimpse(viewers_7day)
## Observations: 6
## Variables: 11
## $ series           <fct> 3, 4, 5, 6, 7, 8
## $ e1_viewers_7day  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ e2_viewers_7day  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ e3_viewers_7day  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ e4_viewers_7day  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ e5_viewers_7day  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ e6_viewers_7day  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ e7_viewers_7day  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ e8_viewers_7day  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ e9_viewers_7day  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ e10_viewers_7day <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
# Adapt code to also rename 7-day viewer data
viewers_7day <- ratings %>% 
    select(series, viewers_7day_ = ends_with("7day"))

# Glimpse
glimpse(viewers_7day)
## Observations: 6
## Variables: 11
## $ series          <fct> 3, 4, 5, 6, 7, 8
## $ viewers_7day_1  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ viewers_7day_2  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ viewers_7day_3  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ viewers_7day_4  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ viewers_7day_5  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ viewers_7day_6  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ viewers_7day_7  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ viewers_7day_8  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ viewers_7day_9  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
# Adapt code to drop 28-day columns; move 7-day to front
viewers_7day <- ratings %>% 
    select(viewers_7day_ = ends_with("7day"), everything(), -contains("28day"))

# Glimpse
glimpse(viewers_7day)
## Observations: 6
## Variables: 26
## $ viewers_7day_1  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ viewers_7day_2  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ viewers_7day_3  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ viewers_7day_4  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ viewers_7day_5  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ viewers_7day_6  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ viewers_7day_7  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ viewers_7day_8  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ viewers_7day_9  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
## $ series          <fct> 3, 4, 5, 6, 7, 8
## $ episodes        <dbl> 10, 10, 10, 10, 10, 10
## $ premiere        <chr> "14-Aug-12", "20-Aug-13", "6-Aug-14", "5-Aug-1...
## $ finale          <chr> "16-Oct-12", "22-Oct-13", "8-Oct-14", "7-Oct-1...
## $ winner          <chr> "John Whaite", "Frances Quinn", "Nancy Birtwhi...
## $ avg_uk_viewers  <dbl> 5.00, 7.35, 10.04, 12.50, 13.85, 9.29
## $ day             <chr> "Tuesday", "Tuesday", "Wednesday", "Wednesday"...
## $ timeslot        <drtn> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:00...
## $ channel         <chr> "BBC Two", "BBC Two", "BBC One", "BBC One", "B...
## $ runner_up_1     <chr> "Brendan Lynch", "Kimberley Wilson", "Luis Tro...
## $ runner_up_2     <chr> "James Morton", "Ruby Tandoh", "Richard Burr",...
## $ season          <dbl> NA, 1, 2, 3, 4, NA
## $ season_premiere <date> NA, 2014-12-28, 2015-09-06, 2016-07-01, 2017-...
## $ season_finale   <date> NA, 2015-03-01, 2015-11-08, 2016-08-12, 2017-...
## $ viewer_growth   <dbl> 2.89, 2.85, 5.00, 3.43, 2.32, 0.58
## $ bbc             <fct> 1, 1, 1, 1, 1, 0
# Adapt code to keep original order
viewers_7day <- ratings %>% 
    select(everything(), -ends_with("28day"), viewers_7day_ = ends_with("7day"))

# Glimpse
glimpse(viewers_7day)
## Observations: 6
## Variables: 26
## $ series          <fct> 3, 4, 5, 6, 7, 8
## $ episodes        <dbl> 10, 10, 10, 10, 10, 10
## $ premiere        <chr> "14-Aug-12", "20-Aug-13", "6-Aug-14", "5-Aug-1...
## $ finale          <chr> "16-Oct-12", "22-Oct-13", "8-Oct-14", "7-Oct-1...
## $ winner          <chr> "John Whaite", "Frances Quinn", "Nancy Birtwhi...
## $ avg_uk_viewers  <dbl> 5.00, 7.35, 10.04, 12.50, 13.85, 9.29
## $ day             <chr> "Tuesday", "Tuesday", "Wednesday", "Wednesday"...
## $ timeslot        <drtn> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:00...
## $ channel         <chr> "BBC Two", "BBC Two", "BBC One", "BBC One", "B...
## $ runner_up_1     <chr> "Brendan Lynch", "Kimberley Wilson", "Luis Tro...
## $ runner_up_2     <chr> "James Morton", "Ruby Tandoh", "Richard Burr",...
## $ season          <dbl> NA, 1, 2, 3, 4, NA
## $ season_premiere <date> NA, 2014-12-28, 2015-09-06, 2016-07-01, 2017-...
## $ season_finale   <date> NA, 2015-03-01, 2015-11-08, 2016-08-12, 2017-...
## $ viewers_7day_1  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ viewers_7day_2  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ viewers_7day_3  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ viewers_7day_4  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ viewers_7day_5  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ viewers_7day_6  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ viewers_7day_7  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ viewers_7day_8  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ viewers_7day_9  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
## $ viewer_growth   <dbl> 2.89, 2.85, 5.00, 3.43, 2.32, 0.58
## $ bbc             <fct> 1, 1, 1, 1, 1, 0

Chapter 3 - Tidy Your Data

Introduction to Tidy Data:

  • Tidy data helps with producing good plots - allows for faceting and the like
  • Data can be tidy but not tame, and can be tame but not tidy
    • In general, tidy data is long rather than wide
    • As a result, tidy data tends to take up more space, but with the advantage of being easier to plot or analyze
  • Can automatically get counts summed to a specific level
    • juniors_tidy %>% count(baker, wt = correct) # variable wt will be the sum of correct
    • ggplot(juniors_tidy, aes(baker, correct)) + geom_col() # roughly the equivalent if plotting the data

Gather:

  • Gathering is the process of converting data from wide to long
    • gather(data, key, value, .)
    • key is the new column containing the variable
    • value is the new column contining the value
    • The . are the columns to be gathered, with column name going to the key column and associated values going to the value column
    • The key and value need to be quoted while the . can be passed bare (unquoted)

Separate:

  • Sometimes, a column really contains two variable, for example when there is spice_trail or the like
  • The separate function requires at least three arguments
    • data - the data frame
    • col - the column that you want to separate (can be a bare variable name since it already exists in the data frame)
    • into - quoted variables to be created, inside the c() function
    • By default, the existing column col is replaced
    • There is also an option for convert=TRUE where it will try to pick the best variable type (especially helpful when creating numbers)
    • There is also the option for sep, where the defaults for separators can be over-ridden to better match the data

Spread:

  • The spread function is designed to convert long data to wide data
    • Spread can be considered a tool to tidy messy rows, where gather is a tool to tidy messy columns
    • data - the data frame
    • key - the key is the column that currently contains what should become the new columns
    • value - value is the column that currently contains what should become the values in the new columns
    • convert=TRUE will help with re-casting variable types (particularly helpful when numbers are being pulled out of a mixed character-number column (likely what drove the need to spread)

Tidy multiple sets of data:

  • Sometimes, there are multiple data components to tidy, where the columns need to be fixed in several ways
    • For example, score_1, guess_1, score_2, guess_2
    • Ideal target would be to have trials (1, 2, 3) in one column, and with columns score and guess containing the variables
  • Example code for converting multiple columns simultaneously
    • juniors_multi %>% gather(key = “key”, value = “value”, score_1:guess_3) %>% separate(key, into = c(“var”, “order”), convert = TRUE)
    • juniors_multi %>% gather(key = “key”, value = “value”, score_1:guess_3) %>% separate(key, into = c(“var”, “order”), convert = TRUE) %>% spread(var, value)

Example code includes:

ratings1 <- readr::read_csv("./RInputFiles/messy_ratings.csv")
## Parsed with column specification:
## cols(
##   series = col_double(),
##   e1 = col_double(),
##   e2 = col_double(),
##   e3 = col_double(),
##   e4 = col_double(),
##   e5 = col_double(),
##   e6 = col_double(),
##   e7 = col_double(),
##   e8 = col_double(),
##   e9 = col_double(),
##   e10 = col_double()
## )
oldRatings <- ratings
ratings <- ratings1
ratings1
## # A tibble: 8 x 11
##   series    e1    e2    e3    e4    e5    e6    e7    e8    e9   e10
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1      1  2.24  3     3     2.6   3.03  2.75 NA    NA    NA    NA   
## 2      2  3.1   3.53  3.82  3.6   3.83  4.25  4.42  5.06 NA    NA   
## 3      3  3.85  4.6   4.53  4.71  4.61  4.82  5.1   5.35  5.7   6.74
## 4      4  6.6   6.65  7.17  6.82  6.95  7.32  7.76  7.41  7.41  9.45
## 5      5  8.51  8.79  9.28 10.2   9.95 10.1  10.3   9.02 10.7  13.5 
## 6      6 11.6  11.6  12.0  12.4  12.4  12    12.4  11.1  12.6  15.0 
## 7      7 13.6  13.4  13.0  13.3  13.1  13.1  13.4  13.3  13.4  15.9 
## 8      8  9.46  9.23  8.68  8.55  8.61  8.61  9.01  8.95  9.03 10.0
# Plot of episode 1 viewers by series
ratings %>%
  ggplot(aes(x=series, y=e1)) + 
  geom_bar(stat="identity")

# Adapt code to plot episode 2 viewers by series
ggplot(ratings, aes(x = series, y = e2)) +
    geom_col() 

# Gather and count episodes
tidy_ratings <- ratings %>%
    gather(key = "episode", value = "viewers_7day", -series, 
           factor_key = TRUE, na.rm = TRUE) %>% 
    arrange(series, episode) %>% 
    mutate(episode_count = row_number())

# Plot viewers by episode and series
ggplot(tidy_ratings, aes(x = episode_count, y = viewers_7day, fill = as.factor(series))) +
    geom_col()

ratings2 <- readr::read_csv("./RInputFiles/messy_ratings2.csv")
## Parsed with column specification:
## cols(
##   .default = col_double()
## )
## See spec(...) for full column specifications.
ratings2$series <- as.factor(ratings2$series)
ratings2
## # A tibble: 8 x 21
##   series e1_7day e1_28day e2_7day e2_28day e3_7day e3_28day e4_7day
##   <fct>    <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>
## 1 1         2.24    NA       3       NA       3       NA       2.6 
## 2 2         3.1     NA       3.53    NA       3.82    NA       3.6 
## 3 3         3.85    NA       4.6     NA       4.53    NA       4.71
## 4 4         6.6     NA       6.65    NA       7.17    NA       6.82
## 5 5         8.51    NA       8.79    NA       9.28    NA      10.2 
## 6 6        11.6     11.7    11.6     11.8    12.0     NA      12.4 
## 7 7        13.6     13.9    13.4     13.7    13.0     13.4    13.3 
## 8 8         9.46     9.72    9.23     9.53    8.68     9.06    8.55
## # ... with 13 more variables: e4_28day <dbl>, e5_7day <dbl>,
## #   e5_28day <dbl>, e6_7day <dbl>, e6_28day <dbl>, e7_7day <dbl>,
## #   e7_28day <dbl>, e8_7day <dbl>, e8_28day <dbl>, e9_7day <dbl>,
## #   e9_28day <dbl>, e10_7day <dbl>, e10_28day <dbl>
# Gather 7-day viewers by episode (ratings2 already loaded)
week_ratings <- ratings2  %>% 
    select(series, ends_with("7day")) %>% 
    gather(episode, viewers_7day, ends_with("7day"), na.rm = TRUE, factor_key = TRUE)
    
# Plot 7-day viewers by episode and series
ggplot(week_ratings, aes(x = episode, y = viewers_7day, group = series)) +
    geom_line() +
    facet_wrap(~series)

# Edit to parse episode number
week_ratings <- ratings2 %>% 
    select(series, ends_with("7day")) %>% 
    gather(episode, viewers_7day, ends_with("7day"), na.rm = TRUE) %>% 
    separate(episode, into = "episode", extra = "drop") %>% 
    mutate(episode = parse_number(episode))
    
# Edit your code to color by series and add a theme
ggplot(week_ratings, aes(x = episode, y = viewers_7day, 
                         group = series, color = series)) +
    geom_line() +
    facet_wrap(~series) +
    guides(color = FALSE) +
    theme_minimal() 

week_ratings_dec <- week_ratings %>%
    mutate(viewers_7day=as.character(viewers_7day)) %>%
    separate(viewers_7day, into=c("viewers_millions", "viewers_decimal"), sep="\\.") %>%
    mutate(viewers_decimal=ifelse(is.na(viewers_decimal), ".", paste0(".", viewers_decimal))) %>%
    dplyr::arrange(series, episode)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [9,
## 17, 46].
# Unite series and episode
ratings3 <- week_ratings_dec %>% 
    unite("viewers_7day", viewers_millions, viewers_decimal)

# Print to view
ratings3
## # A tibble: 74 x 3
##    series episode viewers_7day
##    <fct>    <dbl> <chr>       
##  1 1            1 2_.24       
##  2 1            2 3_.         
##  3 1            3 3_.         
##  4 1            4 2_.6        
##  5 1            5 3_.03       
##  6 1            6 2_.75       
##  7 2            1 3_.1        
##  8 2            2 3_.53       
##  9 2            3 3_.82       
## 10 2            4 3_.6        
## # ... with 64 more rows
# Adapt to change the separator
ratings3 <- week_ratings_dec  %>% 
    unite(viewers_7day, viewers_millions, viewers_decimal, sep="")

# Print to view
ratings3
## # A tibble: 74 x 3
##    series episode viewers_7day
##    <fct>    <dbl> <chr>       
##  1 1            1 2.24        
##  2 1            2 3.          
##  3 1            3 3.          
##  4 1            4 2.6         
##  5 1            5 3.03        
##  6 1            6 2.75        
##  7 2            1 3.1         
##  8 2            2 3.53        
##  9 2            3 3.82        
## 10 2            4 3.6         
## # ... with 64 more rows
# Adapt to cast viewers as a number
ratings3 <- week_ratings_dec  %>% 
    unite(viewers_7day, viewers_millions, viewers_decimal, sep="") %>%
    mutate(viewers_7day = parse_number(viewers_7day))

# Print to view
ratings3
## # A tibble: 74 x 3
##    series episode viewers_7day
##    <fct>    <dbl>        <dbl>
##  1 1            1         2.24
##  2 1            2         3   
##  3 1            3         3   
##  4 1            4         2.6 
##  5 1            5         3.03
##  6 1            6         2.75
##  7 2            1         3.1 
##  8 2            2         3.53
##  9 2            3         3.82
## 10 2            4         3.6 
## # ... with 64 more rows
# Create tidy data with 7- and 28-day viewers
tidy_ratings_all <- ratings2 %>%
    gather(episode, viewers, ends_with("day"), na.rm = TRUE) %>% 
    separate(episode, into = c("episode", "days")) %>%  
    mutate(episode = parse_number(episode),
           days = parse_number(days)) 

# Adapt to spread counted values
tidy_ratings_all %>% 
    count(series, days, wt = viewers) %>%
    spread(key=days, value=n, sep="_")
## # A tibble: 8 x 3
##   series days_7 days_28
##   <fct>   <dbl>   <dbl>
## 1 1        16.6    NA  
## 2 2        31.6    NA  
## 3 3        50.0    NA  
## 4 4        73.5    NA  
## 5 5       100.     NA  
## 6 6       123.    113  
## 7 7       136.    138. 
## 8 8        90.2    92.9
# Fill in blanks to get premiere/finale data
tidy_ratings <- ratings %>%
    gather(episode, viewers, -series, na.rm = TRUE) %>%
    mutate(episode = parse_number(episode)) %>% 
    group_by(series) %>% 
    filter(episode == 1 | episode == max(episode)) %>% 
    ungroup()


# Recode first/last episodes
first_last <- tidy_ratings %>% 
  mutate(episode = recode(episode, `1` = "first", .default = "last")) 

# Fill in to make slope chart
ggplot(first_last, aes(x = episode, y = viewers, color = as.factor(series))) +
  geom_point() +
  geom_line(aes(group = series))

# Switch the variables mapping x-axis and color
ggplot(first_last, aes(x = series, y = viewers, color = episode)) +
  geom_point() + # keep
  geom_line(aes(group = series)) + # keep
  coord_flip() # keep

# Calculate relative increase in viewers
bump_by_series <- first_last %>% 
  spread(episode, viewers) %>%   
  mutate(bump = (last - first) / first)
  
# Fill in to make bar chart of bumps by series
ggplot(bump_by_series, aes(x = series, y = bump)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent) # converts to %


Chapter 4 - Transform Your Data

Complex recoding with case_when:

  • The case_when function allow for vectoizing multiple if-else-then statements
    • The LHS must give (or be) a boolean
    • The default value for else is NA
  • Example using ages of the baker data
    • bakers %>% mutate(gen = if_else(between(birth_year, 1981, 1996), “millenial”, “not millenial”)) # simple if statement (boundaries of between are inclusive)
    • bakers %>% mutate(gen = case_when( between(birth_year, 1965, 1980) ~ “gen_x”, between(birth_year, 1981, 1996) ~ “millenial” )) # logical ~ result
    • bakers %>% mutate(gen = case_when( between(birth_year, 1928, 1945) ~ “silent”, between(birth_year, 1946, 1964) ~ “boomer”, between(birth_year, 1965, 1980) ~ “gen_x”, between(birth_year, 1981, 1996) ~ “millenial”, TRUE ~ “gen_z” ))
    • bakers %>% count(gen, sort = TRUE) %>% mutate(prop = n / sum(n))

Factors:

  • The forcats package is made specifically for working with factors - all functions start with fct_
  • Converting to factors helps ensure the proper ordering of the data
    • ggplot(bakers, aes(x = fct_rev(fct_infreq(gen)))) + geom_bar() # reverse by infrequency (build from small to large) # on-the-fly conversions inside ggplot
    • bakers <- bakers %>% mutate(gen = fct_relevel(gen, “silent”, “boomer”, “gen_x”, “millenial”, “gen_z”)) # conversions of the raw dataset
    • bakers %>% dplyr::pull(gen) %>% levels() # check that this worked
    • ggplot(bakers, aes(x = gen)) + geom_bar() # will now be plotted in the desired order
  • Need to be careful of the proper treatment of factors
    • ggplot(bakers, aes(x = gen, fill = series_winner)) + geom_bar() # FAIL
    • bakers <- bakers %>% mutate(series_winner = as.factor(series_winner))
    • ggplot(bakers, aes(x = gen, fill = series_winner)) + geom_bar() # WORKS
    • ggplot(bakers, aes(x = gen, fill = as.factor(series_winner))) + geom_bar() # ALSO WORKS

Dates:

  • Can use lubridate for convenience functions such as ymd() or dmy(), with the output being ISO (YYYY-MM-DD)
    • Can also include a vector of suspected dates
    • dmy(“17 August 2010”) # will work
    • hosts <- tibble::tribble( ~host, ~bday, ~premiere, “Mary”, “24 March 1935”, “August 17th, 2010”, “Paul”, “1 March 1966”, “August 17th, 2010”)
    • hosts <- hosts %>% mutate(bday = dmy(bday), premiere = mdy(premiere))
  • There are three aspects of timespans
    • interval - time span bound by two real dates
    • duration - exact number of seconds in an interval
    • period - change in clock time of an interval
    • hosts <- hosts %>% mutate(age_int = interval(bday, premiere)) # new variable age_int will be of type interval
    • hosts %>% mutate(years_decimal = age_int / years(1), years_whole = age_int %/% years(1)) # years(1) is one year, so this is fractional and whole (floored) years

Strings:

  • The separate function splits one column in to 2+ columns (for example “age, job” could become “age” and “job”)
    • series5 <- series5 %>% separate(about, into = c(“age”, “occupation”), sep = “,”)
    • series5 <- series5 %>% separate(about, into = c(“age”, “occupation”), sep = “,”) %>% mutate(age = parse_number(age)) # numeric age. Dropping years
  • The stringr package makes working with strings in R easier (typically used within a mutate) - all functions start with str_
    • series5 <- series5 %>% mutate(baker = str_to_upper(baker), showstopper = str_to_lower(showstopper))
    • series5 %>% mutate(pie = str_detect(showstopper, “pie”)) # returns a boolean
    • series5 %>% mutate(showstopper = str_replace(showstopper, “pie”, “tart”)) # find and replace for strings
    • series5 %>% mutate(showstopper = str_remove(showstopper, “pie”)) # remove “pie”, though there may be trailing whitespace
    • series5 %>% mutate(showstopper = str_remove(showstopper, “pie”), showstopper = str_trim(showstopper)) # trim whitespace at the beginning or end

Final thoughts:

  • R using the tidyverse for analysis and presentation
  • Reading data using readr and analyzing using dplyr and ggplot2
  • Taming variable types, names, and values
  • Transforming data using stringr and lubridate
  • The “here” package can make working with file paths much easier

Example code includes:

baker_results <- readr::read_csv("./RInputFiles/baker_results.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   baker_full = col_character(),
##   baker = col_character(),
##   occupation = col_character(),
##   hometown = col_character(),
##   baker_last = col_character(),
##   baker_first = col_character(),
##   first_date_appeared = col_date(format = ""),
##   last_date_appeared = col_date(format = ""),
##   first_date_us = col_date(format = ""),
##   last_date_us = col_date(format = "")
## )
## See spec(...) for full column specifications.
messy_baker_results <- readr::read_csv("./RInputFiles/messy_baker_results.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   series = col_double(),
##   star_baker = col_double(),
##   technical_winner = col_double(),
##   technical_top3 = col_double(),
##   technical_bottom = col_double(),
##   technical_highest = col_double(),
##   technical_lowest = col_double(),
##   technical_median = col_double(),
##   series_winner = col_double(),
##   series_runner_up = col_double(),
##   total_episodes_appeared = col_double(),
##   percent_episodes_appeared = col_double(),
##   percent_technical_top3 = col_double(),
##   first_date_appeared_uk = col_date(format = ""),
##   last_date_appeared_uk = col_date(format = ""),
##   first_date_us = col_date(format = ""),
##   last_date_us = col_date(format = ""),
##   e_1_technical = col_double(),
##   e_10_technical = col_double(),
##   e_2_technical = col_double()
##   # ... with 7 more columns
## )
## See spec(...) for full column specifications.
bakers <- baker_results
glimpse(bakers)
## Observations: 95
## Variables: 24
## $ series                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ...
## $ baker_full                <chr> "Annetha Mills", "David Chambers", "...
## $ baker                     <chr> "Annetha", "David", "Edd", "Jasminde...
## $ age                       <dbl> 30, 31, 24, 45, 25, 51, 44, 48, 37, ...
## $ occupation                <chr> "Single mother", "Entrepreneur", "De...
## $ hometown                  <chr> "Essex", "Milton Keynes", "Bradford"...
## $ baker_last                <chr> "Mills", "Chambers", "Kimber", "Rand...
## $ baker_first               <chr> "Annetha", "David", "Edward", "Jasmi...
## $ star_baker                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, ...
## $ technical_winner          <dbl> 0, 0, 2, 0, 1, 0, 0, 0, 2, 0, 1, 2, ...
## $ technical_top3            <dbl> 1, 1, 4, 2, 1, 0, 0, 0, 4, 2, 3, 5, ...
## $ technical_bottom          <dbl> 1, 3, 1, 2, 2, 1, 1, 0, 1, 2, 1, 3, ...
## $ technical_highest         <dbl> 2, 3, 1, 2, 1, 10, 4, NA, 1, 2, 1, 1...
## $ technical_lowest          <dbl> 7, 8, 6, 5, 9, 10, 4, NA, 8, 5, 5, 6...
## $ technical_median          <dbl> 4.5, 4.5, 2.0, 3.0, 6.0, 10.0, 4.0, ...
## $ series_winner             <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ series_runner_up          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, ...
## $ total_episodes_appeared   <dbl> 2, 4, 6, 5, 3, 1, 2, 1, 6, 6, 4, 8, ...
## $ first_date_appeared       <date> 2010-08-17, 2010-08-17, 2010-08-17,...
## $ last_date_appeared        <date> 2010-08-24, 2010-09-07, 2010-09-21,...
## $ first_date_us             <date> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ last_date_us              <date> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ percent_episodes_appeared <dbl> 33.33333, 66.66667, 100.00000, 83.33...
## $ percent_technical_top3    <dbl> 50.00000, 25.00000, 66.66667, 40.000...
# Create skill variable with 3 levels
bakers <- bakers %>% 
  mutate(skill = case_when(
    star_baker > technical_winner ~ "super_star",
    star_baker < technical_winner ~ "high_tech",
    TRUE ~ "well_rounded"
  ))
  
# Filter zeroes to examine skill variable
bakers %>% 
  filter(star_baker==0 & technical_winner==0) %>% 
  count(skill)
## # A tibble: 1 x 2
##   skill            n
##   <chr>        <int>
## 1 well_rounded    41
# Add pipe to drop skill = NA
bakers_skill <- bakers %>% 
  mutate(skill = case_when(
    star_baker > technical_winner ~ "super_star",
    star_baker < technical_winner ~ "high_tech",
    star_baker == 0 & technical_winner == 0 ~ NA_character_,
    star_baker == technical_winner  ~ "well_rounded"
  )) %>% 
  drop_na(skill)
  
# Count bakers by skill
bakers_skill %>%
  count(skill)
## # A tibble: 3 x 2
##   skill            n
##   <chr>        <int>
## 1 high_tech       24
## 2 super_star      15
## 3 well_rounded    15
# Cast skill as a factor
bakers <- bakers %>% 
  mutate(skill = as.factor(skill))

# Examine levels
bakers %>%
  pull(skill) %>%
  levels()
## [1] "high_tech"    "super_star"   "well_rounded"
baker_dates <- bakers %>%
    select(series, baker, contains("date")) %>%
    mutate(last_date_appeared_us=as.character(last_date_us), 
           first_date_appeared_us=as.character(first_date_us)
           ) %>%
    rename(first_date_appeared_uk=first_date_appeared, last_date_appeared_uk=last_date_appeared) %>%
    select(-last_date_us, -first_date_us)
glimpse(baker_dates)
## Observations: 95
## Variables: 6
## $ series                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ...
## $ baker                  <chr> "Annetha", "David", "Edd", "Jasminder",...
## $ first_date_appeared_uk <date> 2010-08-17, 2010-08-17, 2010-08-17, 20...
## $ last_date_appeared_uk  <date> 2010-08-24, 2010-09-07, 2010-09-21, 20...
## $ last_date_appeared_us  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ first_date_appeared_us <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
# Add a line to extract labeled month
baker_dates <- baker_dates %>% 
  mutate(last_date_appeared_us=lubridate::ymd(last_date_appeared_us), 
         last_month_us=lubridate::month(last_date_appeared_us, label=TRUE)
         )
         
ggplot(baker_dates, aes(x=last_month_us)) + geom_bar()

baker_time <- baker_dates %>%
    mutate(first_date_appeared_us=lubridate::ymd(first_date_appeared_us)) %>%
    select(-last_month_us)
glimpse(baker_time)
## Observations: 95
## Variables: 6
## $ series                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ...
## $ baker                  <chr> "Annetha", "David", "Edd", "Jasminder",...
## $ first_date_appeared_uk <date> 2010-08-17, 2010-08-17, 2010-08-17, 20...
## $ last_date_appeared_uk  <date> 2010-08-24, 2010-09-07, 2010-09-21, 20...
## $ last_date_appeared_us  <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ first_date_appeared_us <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
# Add a line to create whole months on air variable
baker_time <- baker_time  %>% 
  mutate(time_on_air = lubridate::interval(first_date_appeared_uk, last_date_appeared_uk),
         weeks_on_air = time_on_air / lubridate::weeks(1), 
         months_on_air = time_on_air %/% months(1)
         )

# Count rows
messy_baker_results %>% 
  count(position_reached)
## # A tibble: 8 x 2
##   position_reached     n
##   <chr>            <int>
## 1 <NA>                71
## 2 Runner-Up            1
## 3 Runner up            2
## 4 Runner Up           12
## 5 Third Place          1
## 6 winner               2
## 7 Winner               1
## 8 WINNER               5
# Add another mutate to replace "THIRD PLACE" with "RUNNER UP"and count
messy_baker_results <- messy_baker_results %>% 
  mutate(position_reached = str_to_upper(position_reached),
         position_reached = str_replace(position_reached, "-", " "), 
         position_reached = str_replace(position_reached, "THIRD PLACE", "RUNNER UP"))

# Count rows
messy_baker_results %>% 
  count(position_reached)
## # A tibble: 3 x 2
##   position_reached     n
##   <chr>            <int>
## 1 <NA>                71
## 2 RUNNER UP           16
## 3 WINNER               8
# Add a line to create new variable called student
bakers <- bakers %>% 
    mutate(occupation = str_to_lower(occupation), 
           student=str_detect(occupation, "student")
           )

# Find all students and examine occupations
bakers %>% 
  filter(student) %>%
  select(baker, occupation, student)
## # A tibble: 8 x 3
##   baker   occupation                            student
##   <chr>   <chr>                                 <lgl>  
## 1 Jason   civil engineering student             TRUE   
## 2 James   medical student                       TRUE   
## 3 John    law student                           TRUE   
## 4 Ruby    history of art and philosophy student TRUE   
## 5 Martha  student                               TRUE   
## 6 Michael student                               TRUE   
## 7 Rav     student support                       TRUE   
## 8 Liam    student                               TRUE

Modeling Data in the Tidyverse

Chapter 1 - Introduction to Modeling

Background on modeling for explanation:

  • Generally, the model has y as a function of x plus epsilon, where y is the outcome of interest and x is a set of explanatory variables and epsilon is irreducible error
    • The x can be either explanatory or predictive - depends on the purpose of the analysis
  • Example of explanation - can differences in teacher evaluation scores be explained by teacher attributes
    • library(dplyr)
    • library(moderndive)
    • glimpse(evals) # evals data is available in the moderndivw package (From the moderndive package for ModernDive.com:)
    • ggplot(evals, aes(x = score)) + geom_histogram(binwidth = 0.25) + labs(x = “teaching score”, y = “count”) # EDA on scores using histogram
    • evals %>% summarize(mean_score = mean(score), median_score = median(score), sd_score = sd(score)) # summary statistics using dplyr::summarize

Background on modeling for prediction:

  • House sales in King County USA in 2014-2015 (from Kaggle) based on features such as size, bedrooms, etc.
    • glimpse(house_prices)
    • ggplot(house_prices, aes(x = price)) + geom_histogram() + labs(x = “house price”, y = “count”)
    • house_prices <- house_prices %>% mutate(log10_price = log10(price))
    • house_prices %>% select(price, log10_price)
    • ggplot(house_prices, aes(x = log10_price)) + geom_histogram() + labs(x = “log10 house price”, y = “count”) # after transformation

Modeling problem for explanation:

  • Typically, both the function that relates x and y and the function that generates the errors is unknwon
    • Goal is to create a model that can generate y-hat by separating signal from noise
  • Can start by considering linear models, assessed as a starting point by examining a scatter plot
    • ggplot(evals, aes(x = age, y = score)) + geom_point() + labs(x = “age”, y = “score”, title = “Teaching score over age”)
    • ggplot(evals, aes(x = age, y = score)) + geom_jitter() + labs(x = “age”, y = “score”, title = “Teaching score over age (jittered)”)
  • Can further explore the data by looking at correlations among some or all of the potential explanatory variables
    • evals %>% summarize(correlation = cor(score, age))

Modeling problem for prediction:

  • For explanation, we care about the form of the function
  • For prediction, we care mainly that the function makes good predictions (even if it may not be easy to explain)
    • house_prices %>% select(log10_price, condition) %>% glimpse() # condition is a categorical variable saved as a factor
    • ggplot(house_prices, aes(x = condition, y = log10_price)) + geom_boxplot() + labs(x = “house condition”, y = “log10 price”, title = “log10 house price over condition”)
  • Means tend to be at the center of the linear modeling process
    • house_prices %>% group_by(condition) %>% summarize(mean = mean(log10_price), sd = sd(log10_price), n = n())

Example code includes:

data(evals, package="moderndive")
glimpse(evals)
## Observations: 463
## Variables: 13
## $ ID           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ score        <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5,...
## $ age          <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 4...
## $ bty_avg      <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, ...
## $ gender       <fct> female, female, female, female, male, male, male,...
## $ ethnicity    <fct> minority, minority, minority, minority, not minor...
## $ language     <fct> english, english, english, english, english, engl...
## $ rank         <fct> tenure track, tenure track, tenure track, tenure ...
## $ pic_outfit   <fct> not formal, not formal, not formal, not formal, n...
## $ pic_color    <fct> color, color, color, color, color, color, color, ...
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, ...
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 2...
## $ cls_level    <fct> upper, upper, upper, upper, upper, upper, upper, ...
# Plot the histogram
ggplot(evals, aes(x = age)) +
  geom_histogram(binwidth = 5) +
  labs(x = "age", y = "count")

# Compute summary stats
evals %>%
  summarize(mean_age = mean(age),
            median_age = median(age),
            sd_age = sd(age))
## # A tibble: 1 x 3
##   mean_age median_age sd_age
##      <dbl>      <int>  <dbl>
## 1     48.4         48   9.80
data(house_prices, package="moderndive")
glimpse(house_prices)
## Observations: 21,613
## Variables: 21
## $ id            <chr> "7129300520", "6414100192", "5631500400", "24872...
## $ date          <dttm> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09,...
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000,...
## $ bedrooms      <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, ...
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, ...
## $ sqft_living   <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1...
## $ sqft_lot      <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 971...
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0...
## $ waterfront    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ view          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, ...
## $ condition     <fct> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, ...
## $ grade         <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9,...
## $ sqft_above    <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1...
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300...
## $ yr_built      <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, ...
## $ yr_renovated  <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ zipcode       <fct> 98178, 98125, 98028, 98136, 98074, 98053, 98003,...
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47....
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045...
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, ...
## $ sqft_lot15    <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711...
# Plot the histogram
ggplot(house_prices, aes(x = sqft_living)) +
  geom_histogram() +
  labs(x="Size (sq.feet)", y="count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Add log10_sqft_living
house_prices_2 <- house_prices %>%
  mutate(log10_sqft_living = log10(sqft_living))

# Plot the histogram  
ggplot(house_prices_2, aes(x = log10_sqft_living)) +
  geom_histogram() +
  labs(x = "log10 size", y = "count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Plot the histogram
ggplot(evals, aes(x=bty_avg)) +
  geom_histogram(binwidth=0.5) +
  labs(x = "Beauty score", y = "count")

# Scatterplot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point() +
  labs(x = "beauty score", y = "teaching score")

# Jitter plot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_jitter() +
  labs(x = "beauty score", y = "teaching score")

# Compute correlation
evals %>%
  summarize(correlation = cor(score, bty_avg))
## # A tibble: 1 x 1
##   correlation
##         <dbl>
## 1       0.187
house_prices <- house_prices %>%
    mutate(log10_price=log10(price))

# View the structure of log10_price and waterfront
house_prices %>%
  select(log10_price, waterfront) %>%
  glimpse()
## Observations: 21,613
## Variables: 2
## $ log10_price <dbl> 5.346157, 5.730782, 5.255273, 5.781037, 5.707570, ...
## $ waterfront  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
# Plot 
ggplot(house_prices, aes(x = waterfront, y = log10_price)) +
  geom_boxplot() +
  labs(x = "waterfront", y = "log10 price")

# Calculate stats
house_prices %>%
  group_by(waterfront) %>%
  summarize(mean_log10_price = mean(log10_price), n = n())
## # A tibble: 2 x 3
##   waterfront mean_log10_price     n
##   <lgl>                 <dbl> <int>
## 1 FALSE                  5.66 21450
## 2 TRUE                   6.12   163
# Prediction of price for houses with view
10^(6.12)
## [1] 1318257
# Prediction of price for houses without view
10^(5.66)
## [1] 457088.2

Chapter 2 - Modeling with Regression

Explaining teaching score with age:

  • Can overlay a regression line to the scatter plot for a bivariate relationship
    • ggplot(evals, aes(x = age, y = score)) + geom_point() + labs(x = “age”, y = “score”, title = “Teaching score over age”) + geom_smooth(method = “lm”, se = FALSE)
  • In simple linear regression, the assumption is that f(x) is B0 + B1*x
    • The fitted model f-hat does not have an error term, since it is just the model prediction for a given value of x
    • model_score_1 <- lm(score ~ age, data = evals)
    • moderndive::get_regression_table(model_score_1)

Predicting teaching score using age:

  • Can make predictions based on the existing regression line - f-hat can be used for both explanatory and predictive purposes
  • The residuals are the errors (predictive vs. actual values), and correspond to the epsilon of the general modeling framework
    • On average, for linear regression, the residuals should average out to zero
    • get_regression_points(model_score_1) # gives y, x, y-hat, and residuals

Explaining teaching score with gender:

  • Can extend the models to include categorical data, such as gender
    • ggplot(evals, aes(x = score)) + geom_histogram(binwidth = 0.25) + facet_wrap(~gender) + labs(x = “score”, y = “count”)
    • model_score_3 <- lm(score ~ gender, data = evals) # will just give an overall mean and a change in mean vs. the first-level factor
  • Can also look at multi-level factors, such as rank (teacher type)
    • evals %>% group_by(rank) %>% summarize(n = n())

Predicting teaching score with gender:

  • Can use group means as part of the predictive approach - if only factor are used in the regression, there will be the same prediction for everyone who is in the same class(es)
    • model_score_3_points <- get_regression_points(model_score_3)

Example code includes:

# Plot 
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point() +
  labs(x = "beauty score", y = "score") +
  geom_smooth(method = "lm", se = FALSE)

# Fit model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Output content
model_score_2
## 
## Call:
## lm(formula = score ~ bty_avg, data = evals)
## 
## Coefficients:
## (Intercept)      bty_avg  
##     3.88034      0.06664
# Output regression table
moderndive::get_regression_table(model_score_2)
## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    3.88      0.076     51.0        0    3.73     4.03 
## 2 bty_avg      0.067     0.016      4.09       0    0.035    0.099
# Use fitted intercept and slope to get a prediction
y_hat <- 3.88 + 0.067 * 5
y_hat
## [1] 4.215
# Compute residual y - y_hat
4.7 - y_hat
## [1] 0.485
# Get regression table
moderndive::get_regression_table(model_score_2, digits = 5)
## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept   3.88      0.0761     51.0  0         3.73     4.03  
## 2 bty_avg     0.0666    0.0163      4.09 0.00005   0.0346   0.0986
# Get all fitted/predicted values and residuals
moderndive::get_regression_points(model_score_2)
## # A tibble: 463 x 5
##       ID score bty_avg score_hat residual
##    <int> <dbl>   <dbl>     <dbl>    <dbl>
##  1     1   4.7    5         4.21    0.486
##  2     2   4.1    5         4.21   -0.114
##  3     3   3.9    5         4.21   -0.314
##  4     4   4.8    5         4.21    0.586
##  5     5   4.6    3         4.08    0.52 
##  6     6   4.3    3         4.08    0.22 
##  7     7   2.8    3         4.08   -1.28 
##  8     8   4.1    3.33      4.10   -0.002
##  9     9   3.4    3.33      4.10   -0.702
## 10    10   4.5    3.17      4.09    0.409
## # ... with 453 more rows
# Get all fitted/predicted values and residuals
moderndive::get_regression_points(model_score_2) %>% 
  mutate(score_hat_2 = 3.88 + 0.0666 * bty_avg)
## # A tibble: 463 x 6
##       ID score bty_avg score_hat residual score_hat_2
##    <int> <dbl>   <dbl>     <dbl>    <dbl>       <dbl>
##  1     1   4.7    5         4.21    0.486        4.21
##  2     2   4.1    5         4.21   -0.114        4.21
##  3     3   3.9    5         4.21   -0.314        4.21
##  4     4   4.8    5         4.21    0.586        4.21
##  5     5   4.6    3         4.08    0.52         4.08
##  6     6   4.3    3         4.08    0.22         4.08
##  7     7   2.8    3         4.08   -1.28         4.08
##  8     8   4.1    3.33      4.10   -0.002        4.10
##  9     9   3.4    3.33      4.10   -0.702        4.10
## 10    10   4.5    3.17      4.09    0.409        4.09
## # ... with 453 more rows
# Get all fitted/predicted values and residuals
moderndive::get_regression_points(model_score_2) %>% 
  mutate(residual_2 = score - score_hat)
## # A tibble: 463 x 6
##       ID score bty_avg score_hat residual residual_2
##    <int> <dbl>   <dbl>     <dbl>    <dbl>      <dbl>
##  1     1   4.7    5         4.21    0.486      0.486
##  2     2   4.1    5         4.21   -0.114     -0.114
##  3     3   3.9    5         4.21   -0.314     -0.314
##  4     4   4.8    5         4.21    0.586      0.586
##  5     5   4.6    3         4.08    0.52       0.520
##  6     6   4.3    3         4.08    0.22       0.220
##  7     7   2.8    3         4.08   -1.28      -1.28 
##  8     8   4.1    3.33      4.10   -0.002     -0.002
##  9     9   3.4    3.33      4.10   -0.702     -0.702
## 10    10   4.5    3.17      4.09    0.409      0.409
## # ... with 453 more rows
ggplot(evals, aes(x=rank, y=score)) +
  geom_boxplot() +
  labs(x = "rank", y = "score")

evals %>%
  group_by(rank) %>%
  summarize(n = n(), mean_score = mean(score), sd_score = sd(score))
## # A tibble: 3 x 4
##   rank             n mean_score sd_score
##   <fct>        <int>      <dbl>    <dbl>
## 1 teaching       102       4.28    0.498
## 2 tenure track   108       4.15    0.561
## 3 tenured        253       4.14    0.550
# Fit regression model
model_score_4 <- lm(score ~ rank, data = evals)

# Get regression table
moderndive::get_regression_table(model_score_4, digits = 5)
## # A tibble: 3 x 7
##   term             estimate std_error statistic p_value lower_ci upper_ci
##   <chr>               <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept           4.28     0.0536     79.9   0         4.18    4.39  
## 2 ranktenure track   -0.130    0.0748     -1.73  0.0837   -0.277   0.0173
## 3 ranktenured        -0.145    0.0636     -2.28  0.0228   -0.270  -0.0203
# teaching mean
teaching_mean <- 4.28

# tenure track mean
tenure_track_mean <- 4.28-0.13 

# tenure mean
tenure_mean <- 4.28-0.145


# Calculate predictions and residuals
model_score_4_points <- moderndive::get_regression_points(model_score_4)
model_score_4_points
## # A tibble: 463 x 5
##       ID score rank         score_hat residual
##    <int> <dbl> <fct>            <dbl>    <dbl>
##  1     1   4.7 tenure track      4.16    0.545
##  2     2   4.1 tenure track      4.16   -0.055
##  3     3   3.9 tenure track      4.16   -0.255
##  4     4   4.8 tenure track      4.16    0.645
##  5     5   4.6 tenured           4.14    0.461
##  6     6   4.3 tenured           4.14    0.161
##  7     7   2.8 tenured           4.14   -1.34 
##  8     8   4.1 tenured           4.14   -0.039
##  9     9   3.4 tenured           4.14   -0.739
## 10    10   4.5 tenured           4.14    0.361
## # ... with 453 more rows
# Plot residuals
ggplot(model_score_4_points, aes(x=residual)) +
  geom_histogram() +
  labs(x = "residuals", title = "Residuals from score ~ rank model")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Chapter 3 - Modeling with Multiple Regression

Explaining house price with year and size:

  • Can incorporate 2+ explanatory / predictive variable using multiple regression
    • house_prices %>% select(price, sqft_living, condition, waterfront) %>% glimpse()
    • The log-10 transformation is helpful for this specific dataset (assume the code below for future examples in this course)
    • house_prices <- house_prices %>% mutate( log10_price = log10(price), log10_sqft_living = log10(sqft_living) )
  • Exploring the relationship between mutliple variables - EDA and regression
    • Can create a 3D plot with associated regression plane using plotly
    • model_price_1 <- lm(log10_price ~ log10_sqft_living + yr_built, data = house_prices)
    • get_regression_table(model_price_1, digits = 5))

Predicting house price using year and size:

  • Can get the fitted values and exponentiate as needed, assessing the overall fit or lack thereof (sum-squared residuals) of the model
    • get_regression_points(model_price_1, digits = 5)
    • get_regression_points(model_price_1) %>% mutate(sq_residuals = residual^2) %>% summarize(sum_sq_residuals = sum(squared_residuals)) # SSR

Explaining house price with size and condition:

  • The EDA from previous chapters is repeated
    • house_prices <- house_prices %>% mutate( log10_price = log10(price), log10_sqft_living = log10(sqft_living) )
    • house_prices %>% group_by(condition) %>% summarize(mean = mean(log10_price), sd = sd(log10_price), n = n())
  • The parallel slopes model is lines where the slopes are the same but they have a different intercept (likely, coefficients of a categorical variable)
    • model_price_3 <- lm(log10_price ~ log10_sqft_living + condition, data = house_prices)
    • get_regression_table(model_price_3, digits = 5)

Predicting house price using size and condition:

  • Objective is to predict on new data (as opposed to checking our predictions on data where we already had the answer)
    • model_price_3 <- lm(log10_price ~ log10_sqft_living + condition, data = house_prices)
    • get_regression_table(model_price_3)
  • Automating the housing price prediction process
    • new_houses <- read_csv(“new_houses.csv”)
    • new_houses
    • get_regression_points(model_price_3, newdata = new_houses) # moderndata form of predict() function
    • get_regression_points(model_price_3, newdata = new_houses) %>% mutate(price_hat = 10^log10_price_hat)

Example code includes:

# Create scatterplot with regression line
ggplot(house_prices, aes(x=bedrooms, y = log10_price)) +
  geom_point() +
  labs(x = "Number of bedrooms", y = "log10 price") +
  geom_smooth(method = "lm", se = FALSE)

# Remove outlier
house_prices_transform <- house_prices %>%
    filter(bedrooms < 33) %>%
    mutate(log10_sqft_living=log10(sqft_living))

# Create scatterplot with regression line
ggplot(house_prices_transform, aes(x = bedrooms, y = log10_price)) +
  geom_point() +
  labs(x = "Number of bedrooms", y = "log10 price") +
  geom_smooth(method = "lm", se = FALSE)

# Fit model
model_price_2 <- lm(log10_price ~ log10_sqft_living + bedrooms, data = house_prices_transform)

# Get regression table
moderndive::get_regression_table(model_price_2)
## # A tibble: 3 x 7
##   term              estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept            2.69      0.023     116.        0    2.65     2.74 
## 2 log10_sqft_living    0.941     0.008     118.        0    0.925    0.957
## 3 bedrooms            -0.033     0.002     -20.5       0   -0.036   -0.03
# Make prediction in log10 dollars
2.69 + 0.941 * log10(1000) - 0.033 * 3
## [1] 5.414
# Make prediction dollars
10**(2.69 + 0.941 * log10(1000) - 0.033 * 3)
## [1] 259417.9
# Automate prediction and residual computation
moderndive::get_regression_points(model_price_2) %>%
    mutate(squared_residuals = residual**2) %>%
    summarize(sum_squared_residuals = sum(squared_residuals))
## # A tibble: 1 x 1
##   sum_squared_residuals
##                   <dbl>
## 1                  604.
# Fit model
model_price_4 <- lm(log10_price ~ log10_sqft_living + waterfront, data = house_prices_transform)

# Get regression table
moderndive::get_regression_table(model_price_4)
## # A tibble: 3 x 7
##   term              estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept            2.96      0.02      146.        0    2.92     3.00 
## 2 log10_sqft_living    0.825     0.006     134.        0    0.813    0.837
## 3 waterfrontTRUE       0.322     0.013      24.5       0    0.296    0.348
# Prediction for House A
10**(2.96 + 0.825*2.9 + 0.322)
## [1] 472606.8
# Prediction for House B
10**(2.96 + 0.825*3.1 + 0)
## [1] 329230.5
# View the "new" houses
new_houses_2 <- tibble(log10_sqft_living=c(2.9, 3.1), waterfront=c(TRUE, FALSE))
new_houses_2
## # A tibble: 2 x 2
##   log10_sqft_living waterfront
##               <dbl> <lgl>     
## 1               2.9 TRUE      
## 2               3.1 FALSE
# Get predictions price_hat in dollars on "new" houses
moderndive::get_regression_points(model_price_4, newdata = new_houses_2) %>% 
  mutate(price_hat = 10**log10_price_hat)
## # A tibble: 2 x 5
##      ID log10_sqft_living waterfront log10_price_hat price_hat
##   <int>             <dbl> <lgl>                <dbl>     <dbl>
## 1     1               2.9 TRUE                  5.67   472063.
## 2     2               3.1 FALSE                 5.52   328095.

Chapter 4 - Model Selection and Assessment

Model selection and assessment:

  • Can use multiple models for the same data and compare
    • model_price_1 <- lm(log10_price ~ log10_sqft_living + yr_built, data = house_prices)
    • model_price_3 <- lm(log10_price ~ log10_sqft_living + condition, data = house_prices)
    • get_regression_points(model_price_1) %>% mutate(sq_residuals = residual^2) %>% summarize(sum_sq_residuals = sum(sq_residuals))
    • get_regression_points(model_price_3) %>% mutate(sq_residuals = residual^2) %>% summarize(sum_sq_residuals = sum(sq_residuals))

Assessing model fit with R-squared:

  • The R-squared is a reasonable measure of model fit - R-squared = 1 - Var(Residuals) / Var(Y)
    • Larger R-squared is suggestive of better fit, with values (typically) constrained between 0 and 1
    • R-squared is the proportion of variation in the outcome model that can be explained using the model
    • model_price_1 <- lm(log10_price ~ log10_sqft_living + yr_built, data = house_prices)
    • get_regression_points(model_price_1) %>% summarize(r_squared = 1 - var(residual) / var(log10_price))
    • model_price_3 <- lm(log10_price ~ log10_sqft_living + condition, data = house_prices)
    • get_regression_points(model_price_3) %>% summarize(r_squared = 1 - var(residual) / var(log10_price))

Assessing predictions with RMSE:

  • RMSE (Root Mean Squared Error) is a slight variation on RSS
    • Where RSS is the sum-squared of the residuals, RMSE is the square root of the average of the residuals-squared
    • model_price_1 <- lm(log10_price ~ log10_sqft_living + yr_built, data = house_prices)
    • get_regression_points(model_price_1) %>% mutate(sq_residuals = residual^2) %>% summarize(sum_sq_residuals = sum(sq_residuals))
    • get_regression_points(model_price_1) %>% mutate(sq_residuals = residual^2) %>% summarize(mse = mean(sq_residuals))
    • get_regression_points(model_price_1) %>% mutate(sq_residuals = residual^2) %>% summarize(mse = mean(sq_residuals)) %>% mutate(rmse = sqrt(mse))
  • Cannot calculate RMSE on new data - predictions means that we do not know the actual values
    • get_regression_points(model_price_3, newdata = new_houses) %>% mutate(sq_residuals = residual^2) %>% summarize(mse = mean(sq_residuals)) %>% mutate(rmse = sqrt(mse))
    • The above code will crash out, since the residuals do not exist

Validation set prediction framework:

  • Use two different datasets for modeling; a training set used for modeling, and a test set used for assessing likely out-of-sample errors
    • house_prices_shuffled <- house_prices %>% sample_frac(size = 1, replace = FALSE) # Randomly shuffle order of rows
    • train <- house_prices_shuffled %>% slice(1:10000)
    • test <- house_prices_shuffled %>% slice(10001:21613)
    • train_model_price_1 <- lm(log10_price ~ log10_sqft_living + yr_built, data = train)
  • After having trained the model on the train data, can assess the fit using the test data
    • get_regression_points(train_model_price_1, newdata = test)
    • get_regression_points(train_model_price_1, newdata = test) %>% mutate(sq_residuals = residual^2) %>% summarize(rmse = sqrt(mean(sq_residuals)))

Next steps:

  • Tidyverse ties together many of the packages that help with data wrangling and analysis
  • Can extend regressions to areas like polynomials and trees
  • “ModernDive” is a textbook on the tidyverse tools

Example code includes:

# Model 2
model_price_2 <- lm(log10_price ~ log10_sqft_living + bedrooms, data = house_prices_transform)

# Calculate squared residuals
moderndive::get_regression_points(model_price_2) %>% 
    mutate(sq_residuals=residual**2) %>% 
    summarize(sum_sq_residuals=sum(sq_residuals))
## # A tibble: 1 x 1
##   sum_sq_residuals
##              <dbl>
## 1             604.
# Model 4
model_price_4 <- lm(log10_price ~ log10_sqft_living + waterfront, data = house_prices_transform)

# Calculate squared residuals
moderndive::get_regression_points(model_price_4) %>% 
    mutate(sq_residuals = residual**2) %>% 
    summarize(sum_sq_residuals=sum(sq_residuals))
## # A tibble: 1 x 1
##   sum_sq_residuals
##              <dbl>
## 1             599.
# Get fitted/values & residuals, compute R^2 using residuals
moderndive::get_regression_points(model_price_2) %>% 
    summarize(r_squared = 1 - var(residual) / var(log10_price))
## # A tibble: 1 x 1
##   r_squared
##       <dbl>
## 1     0.466
# Get fitted/values & residuals, compute R^2 using residuals
moderndive::get_regression_points(model_price_4) %>% 
    summarize(r_squared = 1 - var(residual) / var(log10_price))
## # A tibble: 1 x 1
##   r_squared
##       <dbl>
## 1     0.470
# Get all residuals, square them, take the mean and square root
moderndive::get_regression_points(model_price_2) %>% 
    mutate(sq_residuals = residual^2) %>% 
    summarize(mse = mean(sq_residuals)) %>% 
    mutate(rmse = sqrt(mse))
## # A tibble: 1 x 2
##      mse  rmse
##    <dbl> <dbl>
## 1 0.0279 0.167
# MSE and RMSE for model_price_2
moderndive::get_regression_points(model_price_2) %>% 
    mutate(sq_residuals = residual^2) %>% 
    summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))
## # A tibble: 1 x 2
##      mse  rmse
##    <dbl> <dbl>
## 1 0.0279 0.167
# MSE and RMSE for model_price_4
moderndive::get_regression_points(model_price_4) %>% 
    mutate(sq_residuals = residual^2) %>% 
    summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))
## # A tibble: 1 x 2
##      mse  rmse
##    <dbl> <dbl>
## 1 0.0277 0.166
# Set random number generator seed value for reproducibility
set.seed(76)

# Randomly reorder the rows
house_prices_shuffled <- house_prices_transform %>% 
    sample_frac(size = 1, replace = FALSE)

# Train/test split
train <- house_prices_shuffled %>% 
    slice(1:10000)
test <- house_prices_shuffled %>% 
    slice(10001:nrow(.))

# Fit model to training set
train_model_2 <- lm(log10_price ~ log10_sqft_living + bedrooms, data=train)


# Compute RMSE (train)
moderndive::get_regression_points(train_model_2) %>% 
    mutate(sq_residuals = residual**2) %>% 
    summarize(rmse = sqrt(mean(sq_residuals)))
## # A tibble: 1 x 1
##    rmse
##   <dbl>
## 1 0.167
# Compute RMSE (test)
moderndive::get_regression_points(train_model_2, newdata = test) %>% 
    mutate(sq_residuals = residual**2) %>% 
    summarize(rmse = sqrt(mean(sq_residuals)))
## # A tibble: 1 x 1
##    rmse
##   <dbl>
## 1 0.167

Analyzing Survey Data in R

Chapter 1 - Introduction to Survey Data

What are survey weights?

  • Survey weights sometimes appear inside a dataset, to reflect potential over/under sampling
    • Survey weights result from a complex survey design - number of points in the population represented by each entry in the sampling frame
    • For example, average income would be the sum-product of weights and incomes divided by the sum of weights

Specifying elements of the design in R:

  • Simple random sampling is when every member of the population is known and had an equal chance of being selected
    • library(survey)
    • srs_design <- svydesign(data = paSample, weights = ~wts, fpc=~N, id=~1) # the ~ means that these are column names
  • Stratified sampling is when a simple random sample is taken from each of the strata (sub-units)
    • For example, taking 100 people from every county in a state, so that county-level averages can be gathered
    • stratified_design <- svydesign(data = paSample, id = ~1, weights = ~wts, strata = ~county, fpc = ~N)
  • Cluster sampling is when the population are grouped in to clusters, with a simple random sample of clusters selected, and with simple random samples taken within each selected cluster
    • cluster_design <- svydesign(data = paSample, id = ~county + personid, fpc = ~N1 + N2, weights = ~wts)

Visualizing impact of survey weights:

  • NHANES data - assessment of health of persons in the US, derived by a health check in a mobile doctor’s office
    • Stage 0 - stratified by geography and proportion minority
    • Stage 1 - within strata, counties randomly selected (selection likelihood proportional to population)
    • Stage 2 - within counties, city blocks randomly selected (selection likelihood proportional to population)
    • Stage 3 - within city blocks, households randomly selected (based on demographics)
    • Stage 4 - within households, people randomly selected
  • NHANES data are availabl through a package in R
    • library(NHANES)
    • dim(NHANESraw)
    • summarize(NHANESraw, N_hat = sum(WTMEC2YR)) # sums to double the US population, due to having 4 years of data when desiring only 2 years of data
    • NHANESraw <- mutate(NHANESraw, WTMEC4YR = WTMEC2YR/2) # fix the double population issue
    • NHANES_design <- svydesign(data = NHANESraw, strata = ~SDMVSTRA, id = ~SDMVPSU, nest = TRUE, weights = ~WTMEC4YR) # id is the cluster (first-level), nest=TRUE is due to id being nested within strata (???)
    • distinct(NHANESraw, SDMVPSU) # only takes 3 values, since only 1-3 counties are selected

Example code includes:

colTypes <- "FINLWT21 numeric _ FINCBTAX integer _ BLS_URBN integer _ POPSIZE integer _ EDUC_REF character _ EDUCA2 character _ AGE_REF integer _ AGE2 character _ SEX_REF integer _ SEX2 integer _ REF_RACE integer _ RACE2 integer _ HISP_REF integer _ HISP2 integer _ FAM_TYPE integer _ MARITAL1 integer _ REGION integer _ SMSASTAT integer _ HIGH_EDU character _ EHOUSNGC numeric _ TOTEXPCQ numeric _ FOODCQ numeric _ TRANSCQ numeric _ HEALTHCQ numeric _ ENTERTCQ numeric _ EDUCACQ integer _ TOBACCCQ numeric _ STUDFINX character _ IRAX character _ CUTENURE integer _ FAM_SIZE integer _ VEHQ integer _ ROOMSQ character _ INC_HRS1 character _ INC_HRS2 character _ EARNCOMP integer _ NO_EARNR integer _ OCCUCOD1 character _ OCCUCOD2 character _ STATE character _ DIVISION integer _ TOTXEST integer _ CREDFINX character _ CREDITB integer _ CREDITX character _ BUILDING character _ ST_HOUS integer _ INT_PHON character _ INT_HOME character _ "

ce <- readr::read_csv("./RInputFiles/ce.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   EDUC_REF = col_character(),
##   EDUCA2 = col_character(),
##   AGE2 = col_character(),
##   HIGH_EDU = col_character(),
##   STUDFINX = col_character(),
##   IRAX = col_character(),
##   ROOMSQ = col_character(),
##   INC_HRS1 = col_character(),
##   INC_HRS2 = col_character(),
##   OCCUCOD1 = col_character(),
##   OCCUCOD2 = col_character(),
##   STATE = col_character(),
##   CREDFINX = col_character(),
##   CREDITX = col_character(),
##   BUILDING = col_character(),
##   INT_PHON = col_logical(),
##   INT_HOME = col_logical()
## )
## See spec(...) for full column specifications.
glimpse(ce)
## Observations: 6,301
## Variables: 49
## $ FINLWT21 <dbl> 25984.767, 6581.018, 20208.499, 18078.372, 20111.619,...
## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 4...
## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ POPSIZE  <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1,...
## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12",...
## $ EDUCA2   <chr> "15", "15", "13", NA, NA, NA, NA, "15", "15", "14", "...
## $ AGE_REF  <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 5...
## $ AGE2     <chr> "50", "47", "46", ".", ".", ".", ".", "36", "53", "67...
## $ SEX_REF  <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ SEX2     <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1...
## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,...
## $ RACE2    <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2...
## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ HISP2    <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2...
## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4,...
## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1,...
## $ REGION   <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1,...
## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15",...
## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ FOODCQ   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TRANSCQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EDUCACQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ STUDFINX <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", "."...
## $ IRAX     <chr> "1000000", "10000", "0", ".", ".", "0", "0", "15000",...
## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1,...
## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4,...
## $ VEHQ     <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4,...
## $ ROOMSQ   <chr> "8", "5", "6", "4", "4", "4", "7", "5", "4", "9", "6"...
## $ INC_HRS1 <chr> "40", "40", "40", "44", "40", ".", ".", "40", "40", "...
## $ INC_HRS2 <chr> "30", "40", "52", ".", ".", ".", ".", "40", "40", "."...
## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3,...
## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3,...
## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", NA, NA, "12", "04", NA,...
## $ OCCUCOD2 <chr> "04", "02", "01", NA, NA, NA, NA, "02", "03", NA, "11...
## $ STATE    <chr> "41", "15", "48", "06", "06", "48", "06", "42", NA, "...
## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2...
## $ TOTXEST  <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1...
## $ CREDFINX <chr> "0", ".", "0", ".", "5", ".", ".", ".", ".", "0", "."...
## $ CREDITB  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ CREDITX  <chr> "4000", "5000", "2000", ".", "7000", "1800", ".", "60...
## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01",...
## $ ST_HOUS  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
ceColTypes <- ""
for (x in names(ce)) { ceColTypes <- paste0(ceColTypes, x, " ", class(ce[, x, drop=TRUE]), " _ ") }
all.equal(colTypes, ceColTypes)
## [1] "1 string mismatch"
# Construct a histogram of the weights
ggplot(data = ce, mapping = aes(x = FINLWT21)) +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# In the next few exercises we will practice specifying sampling designs using different samples from the api dataset, located in the survey package
# The api dataset contains the Academic Performance Index and demographic information for schools in California
# The apisrs dataset is a simple random sample of schools from the api dataset
# Notice that pw contains the survey weights and fpc contains the total number of schools in the population

data(api, package="survey")
library(survey)
## Loading required package: grid
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
# Look at the apisrs dataset
glimpse(apisrs)
## Observations: 200
## Variables: 39
## $ cds      <chr> "15739081534155", "19642126066716", "30664493030640",...
## $ stype    <fct> H, E, H, E, E, E, M, E, E, E, E, H, M, E, E, E, M, M,...
## $ name     <chr> "McFarland High", "Stowers (Cecil ", "Brea-Olinda Hig...
## $ sname    <chr> "McFarland High", "Stowers (Cecil B.) Elementary", "B...
## $ snum     <dbl> 1039, 1124, 2868, 1273, 4926, 2463, 2031, 1736, 2142,...
## $ dname    <chr> "McFarland Unified", "ABC Unified", "Brea-Olinda Unif...
## $ dnum     <int> 432, 1, 79, 187, 640, 284, 401, 401, 470, 632, 401, 7...
## $ cname    <chr> "Kern", "Los Angeles", "Orange", "Los Angeles", "San ...
## $ cnum     <int> 14, 18, 29, 18, 39, 18, 18, 18, 18, 37, 18, 24, 14, 1...
## $ flag     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ pcttest  <int> 98, 100, 98, 99, 99, 93, 98, 99, 100, 90, 95, 100, 97...
## $ api00    <int> 462, 878, 734, 772, 739, 835, 456, 506, 543, 649, 556...
## $ api99    <int> 448, 831, 742, 657, 719, 822, 472, 474, 458, 604, 575...
## $ target   <int> 18, NA, 3, 7, 4, NA, 16, 16, 17, 10, 11, 9, 14, 5, 15...
## $ growth   <int> 14, 47, -8, 115, 20, 13, -16, 32, 85, 45, -19, 51, 4,...
## $ sch.wide <fct> No, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, No, Ye...
## $ comp.imp <fct> Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, No, No, Ye...
## $ both     <fct> No, Yes, No, Yes, Yes, Yes, No, Yes, Yes, No, No, Yes...
## $ awards   <fct> No, Yes, No, Yes, Yes, No, No, Yes, Yes, No, No, Yes,...
## $ meals    <int> 44, 8, 10, 70, 43, 16, 81, 98, 94, 85, 81, 67, 77, 20...
## $ ell      <int> 31, 25, 10, 25, 12, 19, 40, 65, 65, 57, 4, 25, 32, 16...
## $ yr.rnd   <fct> NA, NA, NA, NA, NA, NA, NA, No, NA, NA, NA, NA, NA, N...
## $ mobility <int> 6, 15, 7, 23, 12, 13, 22, 43, 15, 10, 20, 12, 4, 32, ...
## $ acs.k3   <int> NA, 19, NA, 23, 20, 19, NA, 18, 19, 16, 16, NA, NA, 1...
## $ acs.46   <int> NA, 30, NA, NA, 29, 29, 30, 29, 32, 25, 27, NA, NA, 2...
## $ acs.core <int> 24, NA, 28, NA, NA, NA, 27, NA, NA, 30, NA, 17, 27, N...
## $ pct.resp <int> 82, 97, 95, 100, 91, 71, 49, 75, 99, 49, 62, 96, 77, ...
## $ not.hsg  <int> 44, 4, 5, 37, 8, 1, 30, 49, 48, 23, 5, 44, 40, 4, 14,...
## $ hsg      <int> 34, 10, 9, 40, 21, 8, 27, 31, 34, 36, 38, 19, 34, 14,...
## $ some.col <int> 12, 23, 21, 14, 27, 20, 18, 15, 14, 14, 29, 17, 16, 2...
## $ col.grad <int> 7, 43, 41, 8, 34, 38, 22, 2, 4, 21, 24, 19, 8, 37, 10...
## $ grad.sch <int> 3, 21, 24, 1, 10, 34, 2, 3, 1, 6, 5, 2, 2, 19, 1, 3, ...
## $ avg.ed   <dbl> 1.91, 3.66, 3.71, 1.96, 3.17, 3.96, 2.39, 1.79, 1.77,...
## $ full     <int> 71, 90, 83, 85, 100, 75, 72, 69, 68, 81, 84, 100, 89,...
## $ emer     <int> 35, 10, 18, 18, 0, 20, 25, 22, 29, 7, 16, 0, 11, 5, 6...
## $ enroll   <int> 477, 478, 1410, 342, 217, 258, 1274, 566, 645, 311, 3...
## $ api.stu  <int> 429, 420, 1287, 291, 189, 211, 1090, 353, 563, 258, 2...
## $ pw       <dbl> 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.9...
## $ fpc      <dbl> 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6194,...
# Specify a simple random sampling for apisrs
apisrs_design <- svydesign(data = apisrs, weights = ~pw, fpc = ~fpc, id = ~1)

# Print a summary of the design
summary(apisrs_design)
## Independent Sampling design
## svydesign(data = apisrs, weights = ~pw, fpc = ~fpc, id = ~1)
## Probabilities:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03229 0.03229 0.03229 0.03229 0.03229 0.03229 
## Population size (PSUs): 6194 
## Data variables:
##  [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
##  [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
## [13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
## [19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
## [25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
## [31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
## [37] "api.stu"  "pw"       "fpc"
# Now let's practice specifying a stratified sampling design, using the dataset apistrat
# The schools are stratified based on the school type stype where E = Elementary, M = Middle, and H = High School
# For each school type, a simple random sample of schools was taken

# Glimpse the data
glimpse(apistrat)
## Observations: 200
## Variables: 39
## $ cds      <chr> "19647336097927", "19647336016018", "19648816021505",...
## $ stype    <fct> E, E, E, E, E, E, E, E, E, E, M, M, H, M, H, E, E, M,...
## $ name     <chr> "Open Magnet: Ce", "Belvedere Eleme", "Altadena Eleme...
## $ sname    <chr> "Open Magnet: Center for Individual (Char", "Belveder...
## $ snum     <dbl> 2077, 1622, 2236, 1921, 6140, 6077, 6071, 904, 4637, ...
## $ dname    <chr> "Los Angeles Unified", "Los Angeles Unified", "Pasade...
## $ dnum     <int> 401, 401, 541, 401, 460, 689, 689, 41, 702, 135, 590,...
## $ cname    <chr> "Los Angeles", "Los Angeles", "Los Angeles", "Los Ang...
## $ cnum     <int> 18, 18, 18, 18, 55, 55, 55, 14, 36, 36, 35, 32, 9, 1,...
## $ flag     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ pcttest  <int> 99, 100, 99, 100, 100, 100, 99, 98, 100, 100, 99, 99,...
## $ api00    <int> 840, 516, 531, 501, 720, 805, 778, 731, 592, 669, 496...
## $ api99    <int> 816, 476, 544, 457, 659, 780, 787, 731, 508, 658, 479...
## $ target   <int> NA, 16, 13, 17, 7, 1, 1, 3, 15, 7, 16, 15, 17, 20, 13...
## $ growth   <int> 24, 40, -13, 44, 61, 25, -9, 0, 84, 11, 17, 6, 7, 3, ...
## $ sch.wide <fct> Yes, Yes, No, Yes, Yes, Yes, No, No, Yes, Yes, Yes, N...
## $ comp.imp <fct> No, Yes, No, Yes, Yes, Yes, No, No, Yes, No, No, No, ...
## $ both     <fct> No, Yes, No, Yes, Yes, Yes, No, No, Yes, No, No, No, ...
## $ awards   <fct> No, Yes, No, Yes, Yes, Yes, No, No, Yes, No, No, No, ...
## $ meals    <int> 33, 98, 64, 83, 26, 7, 9, 45, 75, 47, 69, 60, 66, 54,...
## $ ell      <int> 25, 77, 23, 63, 17, 0, 2, 2, 58, 23, 25, 10, 43, 26, ...
## $ yr.rnd   <fct> No, Yes, No, No, No, No, No, No, Yes, No, No, No, No,...
## $ mobility <int> 11, 26, 17, 13, 31, 12, 10, 15, 23, 19, 26, 22, 16, 4...
## $ acs.k3   <int> 20, 19, 20, 17, 20, 19, 19, 19, 20, 18, NA, NA, NA, N...
## $ acs.46   <int> 29, 28, 30, 30, 30, 29, 31, 31, 32, 29, 32, 32, NA, 3...
## $ acs.core <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 30, 32, 27, 2...
## $ pct.resp <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 87, 67, 50, 70, 71, ...
## $ not.hsg  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 31, 49, 12, 20, 45, ...
## $ hsg      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 34, 20, 33, 20, 36, ...
## $ some.col <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22, 15, 23, 31, 11, ...
## $ col.grad <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 12, 29, 23, 8, 9...
## $ grad.sch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 3, 6, 0, 0, 11...
## $ avg.ed   <dbl> 3.32, 1.67, 2.34, 1.86, 3.17, 3.64, 3.55, 3.10, 2.17,...
## $ full     <int> 100, 57, 81, 64, 90, 95, 96, 93, 91, 96, 84, 65, 93, ...
## $ emer     <int> 0, 40, 26, 24, 7, 0, 0, 8, 14, 0, 18, 37, 17, 26, 19,...
## $ enroll   <int> 276, 841, 441, 298, 354, 330, 385, 583, 763, 381, 129...
## $ api.stu  <int> 241, 631, 415, 288, 319, 315, 363, 510, 652, 322, 103...
## $ pw       <dbl> 44.21, 44.21, 44.21, 44.21, 44.21, 44.21, 44.21, 44.2...
## $ fpc      <dbl> 4421, 4421, 4421, 4421, 4421, 4421, 4421, 4421, 4421,...
# Summarize strata sample sizes
apistrat %>%
  count(stype)
## # A tibble: 3 x 2
##   stype     n
##   <fct> <int>
## 1 E       100
## 2 H        50
## 3 M        50
# Specify the design
strat_design <- svydesign(data = apistrat, weights = ~pw, fpc = ~fpc, id = ~1, strata = ~stype)

# Look at the summary information for the stratified design
summary(strat_design)
## Stratified Independent Sampling design
## svydesign(data = apistrat, weights = ~pw, fpc = ~fpc, id = ~1, 
##     strata = ~stype)
## Probabilities:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02262 0.02262 0.03587 0.04014 0.05339 0.06623 
## Stratum Sizes: 
##              E  H  M
## obs        100 50 50
## design.PSU 100 50 50
## actual.PSU 100 50 50
## Population stratum sizes (PSUs): 
##    E    H    M 
## 4421  755 1018 
## Data variables:
##  [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
##  [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
## [13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
## [19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
## [25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
## [31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
## [37] "api.stu"  "pw"       "fpc"
# Now let's practice specifying a cluster sampling design, using the dataset apiclus2
# The schools were clustered based on school districts, dnum
# Within a sampled school district, 5 schools were randomly selected for the sample
# The schools are denoted by snum
# The number of districts is given by fpc1 and the number of schools in the sampled districts is given by fpc2

# Glimpse the data
glimpse(apiclus2)
## Observations: 126
## Variables: 40
## $ cds      <chr> "31667796031017", "55751846054837", "41688746043517",...
## $ stype    <fct> E, E, E, M, E, E, E, E, M, H, E, M, E, E, E, E, H, E,...
## $ name     <chr> "Alta-Dutch Flat", "Tenaya Elementa", "Panorama Eleme...
## $ sname    <chr> "Alta-Dutch Flat Elementary", "Tenaya Elementary", "P...
## $ snum     <dbl> 3269, 5979, 4958, 4957, 4956, 4915, 2548, 2550, 2549,...
## $ dname    <chr> "Alta-Dutch Flat Elem", "Big Oak Flat-Grvlnd Unif", "...
## $ dnum     <int> 15, 63, 83, 83, 83, 117, 132, 132, 132, 152, 152, 152...
## $ cname    <chr> "Placer", "Tuolumne", "San Mateo", "San Mateo", "San ...
## $ cnum     <int> 30, 54, 40, 40, 40, 39, 19, 19, 19, 5, 5, 5, 36, 36, ...
## $ flag     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ pcttest  <int> 100, 100, 98, 100, 98, 100, 100, 100, 100, 96, 98, 10...
## $ api00    <int> 821, 773, 600, 740, 716, 811, 472, 520, 568, 591, 544...
## $ api99    <int> 785, 718, 632, 740, 711, 779, 432, 494, 589, 585, 554...
## $ target   <int> 1, 4, 8, 3, 4, 1, 18, 15, 11, 11, 12, 11, NA, NA, NA,...
## $ growth   <int> 36, 55, -32, 0, 5, 32, 40, 26, -21, 6, -10, 29, 14, 2...
## $ sch.wide <fct> Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No, No, Yes...
## $ comp.imp <fct> Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No, No, Yes...
## $ both     <fct> Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No, No, Yes...
## $ awards   <fct> Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No, No, Yes...
## $ meals    <int> 27, 43, 33, 11, 5, 25, 78, 76, 68, 42, 63, 54, 0, 4, ...
## $ ell      <int> 0, 0, 5, 4, 2, 5, 38, 34, 34, 23, 42, 24, 3, 6, 2, 1,...
## $ yr.rnd   <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N...
## $ mobility <int> 14, 12, 9, 8, 6, 19, 13, 13, 15, 4, 15, 15, 24, 19, 1...
## $ acs.k3   <int> 17, 18, 19, NA, 18, 20, 19, 25, NA, NA, 20, NA, 19, 1...
## $ acs.46   <int> 20, 34, 29, 30, 28, 22, NA, 23, 24, NA, NA, 27, 27, 2...
## $ acs.core <int> NA, NA, NA, 24, NA, 31, NA, NA, 25, 21, NA, 18, NA, N...
## $ pct.resp <int> 89, 98, 79, 96, 98, 93, 100, 46, 91, 94, 93, 88, 90, ...
## $ not.hsg  <int> 4, 8, 8, 5, 3, 5, 48, 30, 63, 20, 29, 27, 0, 1, 0, 1,...
## $ hsg      <int> 16, 33, 28, 27, 14, 9, 32, 27, 16, 18, 32, 25, 0, 7, ...
## $ some.col <int> 53, 37, 30, 35, 22, 30, 15, 21, 13, 27, 26, 24, 4, 8,...
## $ col.grad <int> 21, 15, 32, 27, 58, 37, 4, 13, 6, 28, 7, 18, 51, 42, ...
## $ grad.sch <int> 6, 7, 1, 6, 3, 19, 1, 9, 2, 7, 6, 7, 44, 41, 100, 45,...
## $ avg.ed   <dbl> 3.07, 2.79, 2.90, 3.03, 3.44, 3.56, 1.77, 2.42, 1.68,...
## $ full     <int> 100, 100, 100, 82, 100, 94, 96, 86, 75, 100, 100, 97,...
## $ emer     <int> 0, 0, 0, 18, 8, 6, 8, 24, 21, 4, 4, 3, 0, 4, 0, 4, 28...
## $ enroll   <int> 152, 312, 173, 201, 147, 234, 184, 512, 543, 332, 217...
## $ api.stu  <int> 120, 270, 151, 179, 136, 189, 158, 419, 423, 303, 182...
## $ pw       <dbl> 18.925, 18.925, 18.925, 18.925, 18.925, 18.925, 18.92...
## $ fpc1     <dbl> 757, 757, 757, 757, 757, 757, 757, 757, 757, 757, 757...
## $ fpc2     <int> <array[25]>
# Specify the design
apiclus_design <- svydesign(id = ~dnum + snum, data = apiclus2, weights = ~pw, fpc = ~fpc1 + fpc2)

#Look at the summary information stored for both designs
summary(apiclus_design)
## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## svydesign(id = ~dnum + snum, data = apiclus2, weights = ~pw, 
##     fpc = ~fpc1 + fpc2)
## Probabilities:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.003669 0.037743 0.052840 0.042390 0.052840 0.052840 
## Population size (PSUs): 757 
## Data variables:
##  [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
##  [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
## [13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
## [19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
## [25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
## [31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
## [37] "api.stu"  "pw"       "fpc1"     "fpc2"
# Construct histogram of pw
ggplot(data = apisrs, mapping = aes(x = pw)) + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Computation failed in `stat_bin()`:
## `binwidth` must be positive

# Construct histogram of pw
ggplot(data = apistrat, mapping = aes(x = pw)) + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Construct histogram of pw
ggplot(data = apiclus2, mapping = aes(x = pw)) + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

NHANESraw <- read.csv("./RInputFiles/NHANESraw.txt")
NHANESraw <- NHANESraw %>%
    mutate(WTMEC4YR=WTMEC2YR / 2)
names(NHANESraw)[1] <- "SurveyYr"
glimpse(NHANESraw)
## Observations: 20,293
## Variables: 78
## $ SurveyYr            <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_1...
## $ ID                  <int> 51624, 51625, 51626, 51627, 51628, 51629, ...
## $ Gender              <fct> male, male, male, male, female, male, fema...
## $ Age                 <int> 34, 4, 16, 10, 60, 26, 49, 1, 10, 80, 10, ...
## $ AgeMonths           <int> 409, 49, 202, 131, 722, 313, 596, 12, 124,...
## $ Race1               <fct> White, Other, Black, Black, Black, Mexican...
## $ Race3               <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Education           <fct> c_HighSchool, NA, NA, NA, c_HighSchool, b_...
## $ MaritalStatus       <fct> Married, NA, NA, NA, Widowed, Married, Liv...
## $ HHIncome            <fct> 25000-34999, 20000-24999, 45000-54999, 200...
## $ HHIncomeMid         <int> 30000, 22500, 50000, 22500, 12500, 30000, ...
## $ Poverty             <dbl> 1.36, 1.07, 2.27, 0.81, 0.69, 1.01, 1.91, ...
## $ HomeRooms           <int> 6, 9, 5, 6, 6, 4, 5, 5, 7, 4, 5, 5, 7, NA,...
## $ HomeOwn             <fct> Own, Own, Own, Rent, Rent, Rent, Rent, Ren...
## $ Work                <fct> NotWorking, NA, NotWorking, NA, NotWorking...
## $ Weight              <dbl> 87.4, 17.0, 72.3, 39.8, 116.8, 97.6, 86.7,...
## $ Length              <dbl> NA, NA, NA, NA, NA, NA, NA, 75.7, NA, NA, ...
## $ HeadCirc            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Height              <dbl> 164.7, 105.4, 181.3, 147.8, 166.0, 173.0, ...
## $ BMI                 <dbl> 32.22, 15.30, 22.00, 18.22, 42.39, 32.61, ...
## $ BMICatUnder20yrs    <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ BMI_WHO             <fct> 30.0_plus, 12.0_18.5, 18.5_to_24.9, 12.0_1...
## $ Pulse               <int> 70, NA, 68, 68, 72, 72, 86, NA, 70, 88, 84...
## $ BPSysAve            <int> 113, NA, 109, 93, 150, 104, 112, NA, 108, ...
## $ BPDiaAve            <int> 85, NA, 59, 41, 68, 49, 75, NA, 53, 43, 45...
## $ BPSys1              <int> 114, NA, 112, 92, 154, 102, 118, NA, 106, ...
## $ BPDia1              <int> 88, NA, 62, 36, 70, 50, 82, NA, 60, 62, 38...
## $ BPSys2              <int> 114, NA, 114, 94, 150, 104, 108, NA, 106, ...
## $ BPDia2              <int> 88, NA, 60, 44, 68, 48, 74, NA, 50, 46, 40...
## $ BPSys3              <int> 112, NA, 104, 92, 150, 104, 116, NA, 110, ...
## $ BPDia3              <int> 82, NA, 58, 38, 68, 50, 76, NA, 56, 40, 50...
## $ Testosterone        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ DirectChol          <dbl> 1.29, NA, 1.55, 1.89, 1.16, 1.16, 1.16, NA...
## $ TotChol             <dbl> 3.49, NA, 4.97, 4.16, 5.22, 4.14, 6.70, NA...
## $ UrineVol1           <int> 352, NA, 281, 139, 30, 202, 77, NA, 39, 12...
## $ UrineFlow1          <dbl> NA, NA, 0.415, 1.078, 0.476, 0.563, 0.094,...
## $ UrineVol2           <int> NA, NA, NA, NA, 246, NA, NA, NA, NA, NA, N...
## $ UrineFlow2          <dbl> NA, NA, NA, NA, 2.51, NA, NA, NA, NA, NA, ...
## $ Diabetes            <fct> No, No, No, No, Yes, No, No, No, No, No, N...
## $ DiabetesAge         <int> NA, NA, NA, NA, 56, NA, NA, NA, NA, NA, NA...
## $ HealthGen           <fct> Good, NA, Vgood, NA, Fair, Good, Good, NA,...
## $ DaysPhysHlthBad     <int> 0, NA, 2, NA, 20, 2, 0, NA, NA, 0, NA, 0, ...
## $ DaysMentHlthBad     <int> 15, NA, 0, NA, 25, 14, 10, NA, NA, 0, NA, ...
## $ LittleInterest      <fct> Most, NA, NA, NA, Most, None, Several, NA,...
## $ Depressed           <fct> Several, NA, NA, NA, Most, Most, Several, ...
## $ nPregnancies        <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA, NA, ...
## $ nBabies             <int> NA, NA, NA, NA, 1, NA, 2, NA, NA, NA, NA, ...
## $ Age1stBaby          <int> NA, NA, NA, NA, NA, NA, 27, NA, NA, NA, NA...
## $ SleepHrsNight       <int> 4, NA, 8, NA, 4, 4, 8, NA, NA, 6, NA, 9, N...
## $ SleepTrouble        <fct> Yes, NA, No, NA, No, No, Yes, NA, NA, No, ...
## $ PhysActive          <fct> No, NA, Yes, NA, No, Yes, No, NA, NA, Yes,...
## $ PhyActiveDays       <int> NA, NA, 5, NA, NA, 2, NA, NA, NA, 4, NA, N...
## $ TVHrsDay            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ TVHrsDay.1          <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ TVHrsDayChild       <int> NA, 4, NA, 1, NA, NA, NA, NA, 1, NA, 3, NA...
## $ ComputerHrsDayChild <int> NA, 1, NA, 1, NA, NA, NA, NA, 0, NA, 0, NA...
## $ Alcohol12PlusYr     <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, NA, Yes...
## $ AlcoholDay          <int> NA, NA, NA, NA, NA, 19, 2, NA, NA, 1, NA, ...
## $ AlcoholYear         <int> 0, NA, NA, NA, 0, 48, 20, NA, NA, 52, NA, ...
## $ SmokeNow            <fct> No, NA, NA, NA, Yes, No, Yes, NA, NA, No, ...
## $ Smoke100            <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, NA, Ye...
## $ SmokeAge            <int> 18, NA, NA, NA, 16, 15, 38, NA, NA, 16, NA...
## $ Marijuana           <fct> Yes, NA, NA, NA, NA, Yes, Yes, NA, NA, NA,...
## $ AgeFirstMarij       <int> 17, NA, NA, NA, NA, 10, 18, NA, NA, NA, NA...
## $ RegularMarij        <fct> No, NA, NA, NA, NA, Yes, No, NA, NA, NA, N...
## $ AgeRegMarij         <int> NA, NA, NA, NA, NA, 12, NA, NA, NA, NA, NA...
## $ HardDrugs           <fct> Yes, NA, NA, NA, No, Yes, Yes, NA, NA, NA,...
## $ SexEver             <fct> Yes, NA, NA, NA, Yes, Yes, Yes, NA, NA, NA...
## $ SexAge              <int> 16, NA, NA, NA, 15, 9, 12, NA, NA, NA, NA,...
## $ SexNumPartnLife     <int> 8, NA, NA, NA, 4, 10, 10, NA, NA, NA, NA, ...
## $ SexNumPartYear      <int> 1, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, N...
## $ SameSex             <fct> No, NA, NA, NA, No, No, Yes, NA, NA, NA, N...
## $ SexOrientation      <fct> Heterosexual, NA, NA, NA, NA, Heterosexual...
## $ WTINT2YR            <dbl> 80100.544, 53901.104, 13953.078, 11664.899...
## $ WTMEC2YR            <dbl> 81528.772, 56995.035, 14509.279, 12041.635...
## $ SDMVPSU             <int> 1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, ...
## $ SDMVSTRA            <int> 83, 79, 84, 86, 75, 88, 85, 86, 88, 77, 86...
## $ WTMEC4YR            <dbl> 40764.386, 28497.518, 7254.639, 6020.818, ...
#Create table of average survey weights by race
tab_weights <- NHANESraw %>%
  group_by(Race1) %>%
  summarize(avg_wt = mean(WTMEC4YR))

#Print the table
tab_weights
## # A tibble: 5 x 2
##   Race1    avg_wt
##   <fct>     <dbl>
## 1 Black     8026.
## 2 Hispanic  8579.
## 3 Mexican   8216.
## 4 Other    10116.
## 5 White    26236.
# The two important design variables in NHANESraw are SDMVSTRA, which contains the strata assignment for each unit, and SDMVPSU, which contains the cluster id within a given stratum
# Specify the NHANES design
NHANES_design <- svydesign(data = NHANESraw, strata = ~SDMVSTRA, id = ~SDMVPSU, 
                           nest = TRUE, weights = ~WTMEC4YR
                           )

# Print summary of design
summary(NHANES_design)
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (62) clusters.
## svydesign(data = NHANESraw, strata = ~SDMVSTRA, id = ~SDMVPSU, 
##     nest = TRUE, weights = ~WTMEC4YR)
## Probabilities:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 8.986e-06 5.664e-05 1.054e-04       Inf 1.721e-04       Inf 
## Stratum Sizes: 
##             75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
## obs        803 785 823 829 696 751 696 724 713 683 592 946 598 647 251 862
## design.PSU   2   2   2   2   2   2   2   2   2   2   2   3   2   2   2   3
## actual.PSU   2   2   2   2   2   2   2   2   2   2   2   3   2   2   2   3
##             91  92  93  94  95  96  97  98  99 100 101 102 103
## obs        998 875 602 688 722 676 608 708 682 700 715 624 296
## design.PSU   3   3   2   2   2   2   2   2   2   2   2   2   2
## actual.PSU   3   3   2   2   2   2   2   2   2   2   2   2   2
## Data variables:
##  [1] "SurveyYr"            "ID"                  "Gender"             
##  [4] "Age"                 "AgeMonths"           "Race1"              
##  [7] "Race3"               "Education"           "MaritalStatus"      
## [10] "HHIncome"            "HHIncomeMid"         "Poverty"            
## [13] "HomeRooms"           "HomeOwn"             "Work"               
## [16] "Weight"              "Length"              "HeadCirc"           
## [19] "Height"              "BMI"                 "BMICatUnder20yrs"   
## [22] "BMI_WHO"             "Pulse"               "BPSysAve"           
## [25] "BPDiaAve"            "BPSys1"              "BPDia1"             
## [28] "BPSys2"              "BPDia2"              "BPSys3"             
## [31] "BPDia3"              "Testosterone"        "DirectChol"         
## [34] "TotChol"             "UrineVol1"           "UrineFlow1"         
## [37] "UrineVol2"           "UrineFlow2"          "Diabetes"           
## [40] "DiabetesAge"         "HealthGen"           "DaysPhysHlthBad"    
## [43] "DaysMentHlthBad"     "LittleInterest"      "Depressed"          
## [46] "nPregnancies"        "nBabies"             "Age1stBaby"         
## [49] "SleepHrsNight"       "SleepTrouble"        "PhysActive"         
## [52] "PhyActiveDays"       "TVHrsDay"            "TVHrsDay.1"         
## [55] "TVHrsDayChild"       "ComputerHrsDayChild" "Alcohol12PlusYr"    
## [58] "AlcoholDay"          "AlcoholYear"         "SmokeNow"           
## [61] "Smoke100"            "SmokeAge"            "Marijuana"          
## [64] "AgeFirstMarij"       "RegularMarij"        "AgeRegMarij"        
## [67] "HardDrugs"           "SexEver"             "SexAge"             
## [70] "SexNumPartnLife"     "SexNumPartYear"      "SameSex"            
## [73] "SexOrientation"      "WTINT2YR"            "WTMEC2YR"           
## [76] "SDMVPSU"             "SDMVSTRA"            "WTMEC4YR"
# Number of clusters
NHANESraw %>%
  summarize(n_clusters = n_distinct(SDMVSTRA, SDMVPSU))
##   n_clusters
## 1         62
# Sample sizes in clusters
NHANESraw %>%
  count(SDMVSTRA, SDMVPSU) 
## # A tibble: 62 x 3
##    SDMVSTRA SDMVPSU     n
##       <int>   <int> <int>
##  1       75       1   379
##  2       75       2   424
##  3       76       1   419
##  4       76       2   366
##  5       77       1   441
##  6       77       2   382
##  7       78       1   378
##  8       78       2   451
##  9       79       1   349
## 10       79       2   347
## # ... with 52 more rows

Chapter 2 - Exploring categorical data

Visualizing categorical variables:

  • Can estimate distributions of race, including both the weighted and unweighted distributions
    • tab_unw <- NHANESraw %>% group_by(Race1) %>% summarize(Freq = n()) %>% mutate(Prop = Freq/sum(Freq)) %>% arrange(desc(Prop))
    • ggplot(data = tab_unw, mapping = aes(x = Race1, y = Prop)) + geom_col() + coord_flip() + scale_x_discrete(limits = tab_unw$Race1) # Labels layer omitted
  • Can convert back to the weighted frequencies
    • tab_w <- svytable(~Race1, design = NHANES_design) %>% as.data.frame() %>% mutate(Prop = Freq/sum(Freq)) %>% arrange(desc(Prop))
    • ggplot(data = tab_w, mapping = aes(x = Race1, y = Prop)) + geom_col() + coord_flip() + scale_x_discrete(limits = tab_w$Race1) # Labels layer omitted

Exploring two categorical variables:

  • Can look at diabetes withing the NHANES data, using the syvtable() function
    • svytable(~Diabetes, design = NHANES_design)
    • tab_w <- svytable(~Race1 + Diabetes, design = NHANES_design) # Race and Diabetes
    • tab_w <- as.data.frame(tab_w) # converts contingency table to frame
    • ggplot(data = tab_w, mapping = aes(x = Race1, fill = Diabetes, y = Freq)) + geom_col() + coord_flip()
    • ggplot(data = tab_w, mapping = aes(x = Race1, y = Freq, fill = Diabetes)) + geom_col(position = “fill”) + coord_flip() # stacked bars to 100%

Inference for categorical variables:

  • Formal statistical tests for associations among categorical variables using chi-squared tests for association
    • svychisq(~Race1 + Diabetes, design = NHANES_design, statistic = “Chisq”)

Example code includes:

# Specify the survey design
NHANESraw <- mutate(NHANESraw, WTMEC4YR = .5 * WTMEC2YR)
NHANES_design <- svydesign(data = NHANESraw, strata = ~SDMVSTRA, id = ~SDMVPSU, nest = TRUE, weights = ~WTMEC4YR)

# Determine the levels of Depressed
levels(NHANESraw$Depressed)
## [1] "Most"    "None"    "Several"
# Construct a frequency table of Depressed
tab_w <- svytable(~Depressed, design = NHANES_design)

# Determine class of tab_w
class(tab_w)
## [1] "svytable" "xtabs"    "table"
# Display tab_w
tab_w
## Depressed
##      Most      None   Several 
##  12704441 158758609  32732508
# Add proportions to table
tab_w <- tab_w %>%
  as.data.frame() %>%
  mutate(Prop = Freq/sum(Freq))

# Create a barplot
ggplot(data = tab_w, mapping = aes(x = Depressed, y = Prop)) + 
  geom_col()

# Construct and print a frequency table
tab_D <- svytable(~Depressed, design = NHANES_design)
tab_D
## Depressed
##      Most      None   Several 
##  12704441 158758609  32732508
# Construct and print a frequency table
tab_H <- svytable(~HealthGen, design = NHANES_design)
tab_H
## HealthGen
## Excellent      Fair      Good      Poor     Vgood 
##  27659954  31544030  87497585   5668484  77482169
# Construct and print a frequency table
tab_DH <- svytable(~Depressed + HealthGen, design = NHANES_design)
tab_DH
##          HealthGen
## Depressed  Excellent       Fair       Good       Poor      Vgood
##   Most      563613.3  3935505.6  4698948.1  1650509.5  1855864.8
##   None    21327181.6 17690782.8 59920031.9  2324945.0 57487318.5
##   Several  1870620.9  7355104.8 13950468.6  1253819.6  8302494.5
# Add conditional proportions to tab_DH
tab_DH_cond <- tab_DH %>%
    as.data.frame() %>%
    group_by(HealthGen) %>%
    mutate(n_HealthGen = sum(Freq), Prop_Depressed = Freq/n_HealthGen) %>%
    ungroup()

# Print tab_DH_cond
tab_DH_cond
## # A tibble: 15 x 5
##    Depressed HealthGen      Freq n_HealthGen Prop_Depressed
##    <fct>     <fct>         <dbl>       <dbl>          <dbl>
##  1 Most      Excellent   563613.   23761416.         0.0237
##  2 None      Excellent 21327182.   23761416.         0.898 
##  3 Several   Excellent  1870621.   23761416.         0.0787
##  4 Most      Fair       3935506.   28981393.         0.136 
##  5 None      Fair      17690783.   28981393.         0.610 
##  6 Several   Fair       7355105.   28981393.         0.254 
##  7 Most      Good       4698948.   78569449.         0.0598
##  8 None      Good      59920032.   78569449.         0.763 
##  9 Several   Good      13950469.   78569449.         0.178 
## 10 Most      Poor       1650510.    5229274.         0.316 
## 11 None      Poor       2324945.    5229274.         0.445 
## 12 Several   Poor       1253820.    5229274.         0.240 
## 13 Most      Vgood      1855865.   67645678.         0.0274
## 14 None      Vgood     57487318.   67645678.         0.850 
## 15 Several   Vgood      8302494.   67645678.         0.123
# Create a segmented bar graph of the conditional proportions in tab_DH_cond
ggplot(data = tab_DH_cond, mapping = aes(x = HealthGen, y = Prop_Depressed, fill = Depressed)) + 
  geom_col() + 
  coord_flip() 

# We can also estimate counts with svytotal(). The syntax is given by:
# svytotal(x = ~interaction(Var1, Var2), design = design, na.rm = TRUE)
# For each combination of the two variables, we get an estimate of the total and the standard error


# Estimate the totals for combos of Depressed and HealthGen
tab_totals <- svytotal(x = ~interaction(Depressed, HealthGen), design = NHANES_design, na.rm = TRUE)

# Print table of totals
tab_totals
##                                                       total      SE
## interaction(Depressed, HealthGen)Most.Excellent      563613  139689
## interaction(Depressed, HealthGen)None.Excellent    21327182 1556268
## interaction(Depressed, HealthGen)Several.Excellent  1870621  277198
## interaction(Depressed, HealthGen)Most.Fair          3935506  370256
## interaction(Depressed, HealthGen)None.Fair         17690783 1206307
## interaction(Depressed, HealthGen)Several.Fair       7355105  455364
## interaction(Depressed, HealthGen)Most.Good          4698948  501105
## interaction(Depressed, HealthGen)None.Good         59920032 3375068
## interaction(Depressed, HealthGen)Several.Good      13950469  931077
## interaction(Depressed, HealthGen)Most.Poor          1650510  195136
## interaction(Depressed, HealthGen)None.Poor          2324945  251934
## interaction(Depressed, HealthGen)Several.Poor       1253820  168440
## interaction(Depressed, HealthGen)Most.Vgood         1855865  269970
## interaction(Depressed, HealthGen)None.Vgood        57487319 2975806
## interaction(Depressed, HealthGen)Several.Vgood      8302495  687020
# Estimate the means for combos of Depressed and HealthGen
tab_means <- svymean(x = ~interaction(Depressed, HealthGen), design = NHANES_design, na.rm = TRUE)

# Print table of means
tab_means
##                                                         mean     SE
## interaction(Depressed, HealthGen)Most.Excellent    0.0027603 0.0007
## interaction(Depressed, HealthGen)None.Excellent    0.1044492 0.0053
## interaction(Depressed, HealthGen)Several.Excellent 0.0091613 0.0014
## interaction(Depressed, HealthGen)Most.Fair         0.0192740 0.0019
## interaction(Depressed, HealthGen)None.Fair         0.0866400 0.0047
## interaction(Depressed, HealthGen)Several.Fair      0.0360214 0.0026
## interaction(Depressed, HealthGen)Most.Good         0.0230129 0.0023
## interaction(Depressed, HealthGen)None.Good         0.2934563 0.0092
## interaction(Depressed, HealthGen)Several.Good      0.0683220 0.0033
## interaction(Depressed, HealthGen)Most.Poor         0.0080833 0.0010
## interaction(Depressed, HealthGen)None.Poor         0.0113863 0.0013
## interaction(Depressed, HealthGen)Several.Poor      0.0061405 0.0009
## interaction(Depressed, HealthGen)Most.Vgood        0.0090890 0.0013
## interaction(Depressed, HealthGen)None.Vgood        0.2815422 0.0078
## interaction(Depressed, HealthGen)Several.Vgood     0.0406612 0.0028
# Run a chi square test between Depressed and HealthGen
svychisq(~Depressed + HealthGen, design = NHANES_design, statistic = "Chisq")
## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~Depressed + HealthGen, design = NHANES_design, statistic = "Chisq")
## X-squared = 1592.7, df = 8, p-value < 2.2e-16
# Construct a contingency table
tab <- svytable(~Education + HomeOwn, design=NHANES_design)

# Add conditional proportion of levels of HomeOwn for each educational level
tab_df <- as.data.frame(tab) %>%
  group_by(Education) %>%
  mutate(n_Education = sum(Freq), Prop_HomeOwn = Freq/n_Education) %>%
  ungroup()

# Create a segmented bar graph
ggplot(data = tab_df, mapping = aes(x=Education, y=Prop_HomeOwn, fill=HomeOwn)) + 
  geom_col() + 
  coord_flip()

# Run a chi square test
svychisq(~Education + HomeOwn, 
    design = NHANES_design, 
    statistic = "Chisq")
## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~Education + HomeOwn, design = NHANES_design, statistic = "Chisq")
## X-squared = 531.78, df = 8, p-value = 2.669e-16

Chapter 3 - Exploring quantitative data

Summarizing quantitative data:

  • Can look at the physician health bad variable and summarize
    • NHANESraw %>% filter(Age >= 12) %>% select(DaysPhysHlthBad) # just the data
    • svymean(x = ~DaysPhysHlthBad, design = NHANES_design, na.rm = TRUE) # means of number of days feeling in bad heatlh
    • svyquantile(x = ~DaysPhysHlthBad, design = NHANES_design, na.rm = TRUE, quantiles = 0.5) # get the median (quantile 0.5) of the data
  • Can grab summaries by group using svyby with a function FUN provided
    • svyby(formula = ~DaysPhysHlthBad, by = ~SmokeNow, design = NHANES_design, FUN = svymean, na.rm = TRUE, row.names = FALSE)
    • svyby(formula = ~Age, by = ~SmokeNow, design = NHANES_design, FUN = svymean, na.rm = TRUE, keep.names = FALSE)

Visualizing quantitative data:

  • Can create bar graphs of the means
    • out <- svyby(formula = ~DaysPhysHlthBad, by = ~SmokeNow, design = NHANES_design, FUN = svymean, na.rm = TRUE, keep.names = FALSE)
    • ggplot(data = out, mapping = aes(x = SmokeNow, y = DaysPhysHlthBad)) + geom_col() + labs(y = “Monthly Average Numberof Bad Health Days”, x = “Smoker?”)
    • out <- mutate(out, lower = DaysPhysHlthBad - se, upper = DaysPhysHlthBad + se)
  • Create histograms of the data
    • ggplot(data = out, mapping = aes(x = SmokeNow, y = DaysPhysHlthBad, ymin = lower, ymax = upper)) + geom_col(fill = “lightblue”) + geom_errorbar(width = .5) + labs(y = “Monthly Average Numberof Bad Health Days”, x = “Smoker?”)
    • ggplot(data = NHANESraw, mapping = aes(x = DaysPhysHlthBad, weight = WTMEC4YR)) + geom_histogram(binwidth = 1, color = “white”) + labs(x = “Number of Bad Health Days in a Month”)
  • Create density plots of the data
    • NHANESraw %>% filter(!is.na(DaysPhysHlthBad)) %>% mutate(WTMEC4YR_std = WTMEC4YR/sum(WTMEC4YR)) %>%
    • ggplot(mapping = aes(x = DaysPhysHlthBad, weight = WTMEC4YR_std)) + geom_density(bw = .6, fill = “lightblue”) + labs(x = “Number of Bad Health Days in a Month”)

Inference for quantitative data:

  • May want to compare means across two groups in the data using a weighted 2-sample t-test
    • The test statistic is the difference in means divided by the SE (standard error)
    • svyttest(formula = DaysPhysHlthBad ~ SmokeNow, design = NHANES_design)

Example code includes:

# Compute the survey-weighted mean
svymean(x = ~SleepHrsNight, design = NHANES_design, na.rm = TRUE)
##                 mean     SE
## SleepHrsNight 6.9292 0.0166
# Compute the survey-weighted mean by Gender
svyby(formula = ~SleepHrsNight, by = ~Gender, design = NHANES_design, 
      FUN = svymean, na.rm = TRUE, keep.names = FALSE
      )
##   Gender SleepHrsNight         se
## 1 female      6.976103 0.02374684
## 2   male      6.879050 0.01953263
# Compute the survey-weighted quantiles
svyquantile(x = ~SleepHrsNight, design = NHANES_design, na.rm = TRUE, 
            quantiles = c(0.01, 0.25, 0.5, 0.75, .99)
            )
##               0.01 0.25 0.5 0.75 0.99
## SleepHrsNight    4    6   7    8   10
# Compute the survey-weighted quantiles by Gender
svyby(formula = ~SleepHrsNight, by = ~Gender, design = NHANES_design, FUN = svyquantile, 
      na.rm = TRUE, quantiles = c(0.5), keep.rows = FALSE, keep.var = FALSE
      )
##        Gender statistic
## female female         7
## male     male         7
# Compute the survey-weighted mean by Gender
out <- svyby(formula = ~SleepHrsNight, by = ~Gender, design = NHANES_design, 
             FUN = svymean, na.rm = TRUE, keep.names = FALSE
             )
             
# Construct a bar plot of average sleep by gender
ggplot(data = out, mapping = aes(x=as.factor(Gender), y=SleepHrsNight)) + 
    geom_col() + 
    labs(y="Average Nightly Sleep")

# Add lower and upper columns to out
out_col <- mutate(out, lower = SleepHrsNight - 2*se, upper = SleepHrsNight + 2*se)

# Construct a bar plot of average sleep by gender with error bars
ggplot(data = out_col, mapping = aes(x = Gender, y = SleepHrsNight, ymin = lower, ymax = upper)) + 
    geom_col(fill = "gold") + 
    labs(y = "Average Nightly Sleep") + 
    geom_errorbar(width = 0.7)  

# Create a histogram with a set binwidth
ggplot(data = NHANESraw, mapping = aes(x=SleepHrsNight, weight=WTMEC4YR)) + 
    geom_histogram(binwidth = 1, color = "white") + 
    labs(x = "Hours of Sleep")
## Warning: Removed 7261 rows containing non-finite values (stat_bin).

# Create a histogram with a set binwidth
ggplot(data = NHANESraw, mapping = aes(x=SleepHrsNight, weight=WTMEC4YR)) + 
    geom_histogram(binwidth = 0.5, color = "white") + 
    labs(x = "Hours of Sleep")
## Warning: Removed 7261 rows containing non-finite values (stat_bin).

# Create a histogram with a set binwidth
ggplot(data = NHANESraw, mapping = aes(x=SleepHrsNight, weight=WTMEC4YR)) + 
    geom_histogram(binwidth = 2, color = "white") + 
    labs(x = "Hours of Sleep")
## Warning: Removed 7261 rows containing non-finite values (stat_bin).

# Density plot of sleep faceted by gender
NHANESraw %>% 
    filter(!is.na(SleepHrsNight), !is.na(Gender)) %>%
    group_by(Gender) %>%
    mutate(WTMEC4YR_std = WTMEC4YR/sum(WTMEC4YR)) %>%
    ggplot(mapping = aes(x = SleepHrsNight, weight = WTMEC4YR_std)) + 
        geom_density(bw = 0.6,  fill = "gold") +
        labs(x = "Hours of Sleep") + 
        facet_wrap(~Gender, labeller = "label_both")

# Run a survey-weighted t-test
svyttest(formula = SleepHrsNight ~ Gender, design = NHANES_design)
## Warning in summary.glm(g): observations with zero weight not used for
## calculating dispersion
## Warning in summary.glm(glm.object): observations with zero weight not used
## for calculating dispersion
## 
##  Design-based t-test
## 
## data:  SleepHrsNight ~ Gender
## t = -3.4077, df = 32, p-value = 0.001785
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
##  -0.15287218 -0.04123256
## sample estimates:
## difference in mean 
##        -0.09705237
# Find means of total cholesterol by whether or not active 
out <- svyby(formula = ~TotChol, by = ~PhysActive, design = NHANES_design, 
             FUN = svymean, na.rm = TRUE, keep.names = FALSE
             )

# Construct a bar plot of means of total cholesterol by whether or not active 
ggplot(data = out, mapping = aes(x=PhysActive, y=TotChol)) + 
    geom_col()

# Run t test for difference in means of total cholesterol by whether or not active
svyttest(formula = TotChol ~ PhysActive, design = NHANES_design)
## 
##  Design-based t-test
## 
## data:  TotChol ~ PhysActive
## t = -3.7936, df = 32, p-value = 0.0006232
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
##  -0.20053677 -0.06390939
## sample estimates:
## difference in mean 
##         -0.1322231

Chapter 4 - Modeling quantitative data

Visualization with scatter plots:

  • Can look at head circumference compared to age (only captured for babies) using a scatterplot
    • babies <- filter(NHANESraw, AgeMonths <= 6) %>% select(AgeMonths, HeadCirc)
    • ggplot(data = babies, mapping = aes(x = AgeMonths, y = HeadCirc)) + geom_point()
    • ggplot(data = babies, mapping = aes(x = AgeMonths, y = HeadCirc)) + geom_jitter(width = 0.3, height = 0) # width jitter but no height jitter
  • Can use weighting to extrapolate the scatter plot to the entire population
    • ggplot(data = babies, mapping = aes(x = AgeMonths, y = HeadCirc, size = WTMEC4YR)) + geom_jitter(width = 0.3, height = 0) + guides(size = FALSE)
    • ggplot(data = babies, mapping = aes(x = AgeMonths, y = HeadCirc, size = WTMEC4YR)) + geom_jitter(width = 0.3, height = 0, alpha = 0.3) + guides(size = FALSE)
    • ggplot(data = babies, mapping = aes(x = AgeMonths, y = HeadCirc, color = WTMEC4YR)) + geom_jitter(width = 0.3, height = 0) + guides(color = FALSE)
    • ggplot(data = babies, mapping = aes(x = AgeMonths, y = HeadCirc, alpha = WTMEC4YR)) + geom_jitter(width = 0.3, height = 0) + guides(alpha = FALSE)

Visualizing trends:

  • Survey-weighted lines of best fit can be added using the geom_smooth() in ggplot2, with the weight= provided as an aestehtic to the geom_smooth()
    • ggplot(data = babies, mapping = aes(x = AgeMonths, y = HeadCirc, alpha = WTMEC4YR)) + geom_jitter(width = 0.3, height = 0) + guides(alpha = FALSE) + geom_smooth(method = “lm”, se = FALSE, mapping = aes(weight = WTMEC4YR))
  • Can also graph the best fit trendlines split by a categorical variable
    • babies <- filter(NHANESraw, AgeMonths <= 6) %>% select(AgeMonths, HeadCirc, WTMEC4YR, Gender)
    • ggplot(data = babies, mapping = aes(x = AgeMonths, y = HeadCirc, alpha = WTMEC4YR, color = Gender)) + geom_jitter(width = 0.3, height = 0) + guides(alpha = FALSE) + geom_smooth(method = “lm”, se = FALSE, mapping = aes(weight = WTMEC4YR))

Modeling survey data:

  • Can use the regression equations directly to predict values for a new data point
    • mod <- svyglm(HeadCirc ~ AgeMonths, design = NHANES_design)
    • summary(mod)
  • The standard errors are an assessment of the likely errors between the estimated regression line and the true regression line

More complex modeling:

  • Can extend the simple regression to a multiple regression in a parallel slopes model
    • mod <- svyglm(HeadCirc ~ AgeMonths + Gender, design = NHANES_design)
  • Can also extend the simple regression to a multiple regression with different slopes

Wrap up:

  • Packages survey, dplyr, and ggplot2
  • Survey fundamentals - clusters, strata, weights, svydesign(), etc.
  • Categorical data, svytable(), svychisq()
  • Quantiative data, svymean(), svytotal(), svyby(), svyquantile(), svyttest()
  • Modeling trends, svyglm()

Example code includes:

# Create dataset with only 20 year olds
NHANES20 <- filter(NHANESraw, Age == 20)

# Construct scatter plot
ggplot(data = NHANES20, mapping = aes(x=Height, y=Weight)) + 
    geom_point(alpha = 0.3) + 
    guides(size = FALSE)
## Warning: Removed 4 rows containing missing values (geom_point).

# Construct bubble plot
ggplot(data = NHANES20, mapping = aes(x=Height, y=Weight, size=WTMEC4YR)) + 
    geom_point(alpha = 0.3) + 
    guides(size = FALSE)
## Warning: Removed 4 rows containing missing values (geom_point).

# Construct a scatter plot
ggplot(data = NHANES20, mapping = aes(x=Height, y=Weight, color=WTMEC4YR)) + 
    geom_point() + 
    guides(color = FALSE)
## Warning: Removed 4 rows containing missing values (geom_point).

# Construct a scatter plot
ggplot(data = NHANES20, mapping = aes(x=Height, y=Weight, alpha=WTMEC4YR)) + 
    geom_point() + 
    guides(alpha = FALSE)
## Warning: Removed 4 rows containing missing values (geom_point).

# Add gender to plot
ggplot(data = NHANES20, mapping = aes(x=Height, y=Weight, size=WTMEC4YR, color=Gender)) + 
    geom_point(alpha=0.3) + 
    guides(size = FALSE)
## Warning: Removed 4 rows containing missing values (geom_point).

# Add gender to plot
ggplot(data = NHANES20, mapping = aes(x=Height, y=Weight, alpha=WTMEC4YR, color=Gender)) + 
    geom_point() + 
    guides(alpha = FALSE)
## Warning: Removed 4 rows containing missing values (geom_point).

# Bubble plot with linear of best fit
ggplot(data = NHANESraw, mapping = aes(x = Height, y = Weight, size=WTMEC4YR)) + 
  geom_point(alpha = 0.1) + 
  guides(size = FALSE) + 
  geom_smooth(method = "lm", se = FALSE, mapping = aes(weight=WTMEC4YR))
## Warning: Removed 2279 rows containing non-finite values (stat_smooth).
## Warning: Removed 2279 rows containing missing values (geom_point).

# Add quadratic curve and cubic curve
ggplot(data = NHANESraw, mapping = aes(x = Height, y = Weight, size = WTMEC4YR)) + 
  geom_point(alpha = 0.1) + 
  guides(size = FALSE) + 
  geom_smooth(method = "lm", se = FALSE, mapping = aes(weight = WTMEC4YR)) +
  geom_smooth(method = "lm", se = FALSE, mapping = aes(weight = WTMEC4YR), formula = y ~ poly(x, 2), color = "orange") +
  geom_smooth(method = "lm", se = FALSE, mapping = aes(weight = WTMEC4YR), formula = y ~ poly(x, 3), color = "red")
## Warning: Removed 2279 rows containing non-finite values (stat_smooth).
## Warning: Removed 2279 rows containing non-finite values (stat_smooth).

## Warning: Removed 2279 rows containing non-finite values (stat_smooth).
## Warning: Removed 2279 rows containing missing values (geom_point).

# Add survey-weighted trend lines to bubble plot
ggplot(data = NHANES20, mapping = aes(x = Height, y = Weight, size = WTMEC4YR, color = Gender)) + 
  geom_point(alpha = 0.1) + 
  guides(size = FALSE) + 
  geom_smooth(method = "lm", se = FALSE, linetype = 2)
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).

# Add non-survey-weighted trend lines
ggplot(data = NHANES20, mapping = aes(x = Height, y = Weight, size = WTMEC4YR, color = Gender)) + 
  geom_point(alpha = 0.1) + 
  guides(size = FALSE) + 
  geom_smooth(method = "lm", se = FALSE, linetype = 2) + 
  geom_smooth(method = "lm", se = FALSE, mapping = aes(weight=WTMEC4YR))
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).

# Subset survey design object to only include 20 year olds
NHANES20_design <- subset(NHANES_design, Age == 20)

# Build a linear regression model
mod <- svyglm(Weight ~ Height, design = NHANES20_design)

# Print summary of the model
summary(mod)
## 
## Call:
## svyglm(formula = Weight ~ Height, design = NHANES20_design)
## 
## Survey design:
## subset(NHANES_design, Age == 20)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -67.2571    22.9836  -2.926  0.00674 ** 
## Height        0.8305     0.1368   6.072 1.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 326.6108)
## 
## Number of Fisher Scoring iterations: 2
# Build a linear regression model same slope
mod1 <- svyglm(Weight ~ Height + Gender, design = NHANES20_design)

# Print summary of the same slope model
summary(mod1)
## 
## Call:
## svyglm(formula = Weight ~ Height + Gender, design = NHANES20_design)
## 
## Survey design:
## subset(NHANES_design, Age == 20)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -53.8665    22.7622  -2.366   0.0254 *  
## Height        0.7434     0.1391   5.346  1.2e-05 ***
## Gendermale    2.7207     3.2471   0.838   0.4095    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 325.3881)
## 
## Number of Fisher Scoring iterations: 2
# Build a linear regression model different slopes
mod2 <- svyglm(Weight ~ Height*Gender, design = NHANES20_design)

# Print summary of the different slopes model
summary(mod2)
## 
## Call:
## svyglm(formula = Weight ~ Height * Gender, design = NHANES20_design)
## 
## Survey design:
## subset(NHANES_design, Age == 20)
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          9.5061    21.5357   0.441  0.66257   
## Height               0.3565     0.1269   2.809  0.00932 **
## Gendermale        -131.0884    41.9989  -3.121  0.00438 **
## Height:Gendermale    0.7897     0.2385   3.311  0.00273 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 316.5007)
## 
## Number of Fisher Scoring iterations: 2
# Plot BPDiaAve and BPSysAve by Diabetes and include trend lines
drop_na(NHANESraw, Diabetes) %>% 
    ggplot(mapping = aes(x=BPDiaAve, y=BPSysAve, size=WTMEC4YR, color=Diabetes)) + 
    geom_point(alpha = 0.2) +  
    guides(size = FALSE) + 
    geom_smooth(method="lm", se = FALSE, mapping = aes(weight=WTMEC4YR))
## Warning: Removed 4600 rows containing non-finite values (stat_smooth).
## Warning: Removed 4600 rows containing missing values (geom_point).

# Build simple linear regression model
mod1 <- svyglm(BPSysAve ~ BPDiaAve, design = NHANES_design)

# Build model with different slopes
mod2 <- svyglm(BPSysAve ~ BPDiaAve*Diabetes, design = NHANES_design)

# Summarize models
summary(mod1)
## 
## Call:
## svyglm(formula = BPSysAve ~ BPDiaAve, design = NHANES_design)
## 
## Survey design:
## svydesign(data = NHANESraw, strata = ~SDMVSTRA, id = ~SDMVPSU, 
##     nest = TRUE, weights = ~WTMEC4YR)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 85.74311    1.86920   45.87   <2e-16 ***
## BPDiaAve     0.48150    0.02354   20.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 290.3472)
## 
## Number of Fisher Scoring iterations: 2
summary(mod2)
## 
## Call:
## svyglm(formula = BPSysAve ~ BPDiaAve * Diabetes, design = NHANES_design)
## 
## Survey design:
## svydesign(data = NHANESraw, strata = ~SDMVSTRA, id = ~SDMVPSU, 
##     nest = TRUE, weights = ~WTMEC4YR)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          83.58652    2.05537  40.667  < 2e-16 ***
## BPDiaAve              0.49964    0.02623  19.047  < 2e-16 ***
## DiabetesYes          25.36616    3.56587   7.114 6.53e-08 ***
## BPDiaAve:DiabetesYes -0.22132    0.05120  -4.323 0.000156 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 279.1637)
## 
## Number of Fisher Scoring iterations: 2

Inference for Catgeorical Data

Chapter 1 - Inference for a Single Parameter

General Social Survey:

  • Categorical data are where the data are categories rather than numbers, which is prevalent in the General Social Survey (GSS)
    • Several thousand people are surveyed, with a goal of drawing inferences about the population from the sample
    • Can grab the “gss” dataframe from the tidyverse package
  • Can generate an approximate error by using mean +/- 2*SE
  • The bootstrap can be a valuable way to assess the standard errors - calculate the sample statistic within each replicate, and calculate its distribution
    • library(infer)
    • boot <- gss2016 %>% specify(response=happy, success=“HAPPY”) %>% generate(reps=500, type=“bootstrap”) %>% calculate(stat=“prop”)

CI interpretations:

  • In classicial statistical inference, there is assumed to be a fix but unknown population parameter that is being estimated by way of sampling
  • A 95% CI means that 95% of the intervals formed from random samples would include the true population parameter

Approximation shortcut:

  • Standard errors tend to increase when the sample size is small or the probability is close to 50%
  • The normal distribution (bell curve) can be a useful approximation for a large sample size - the normal becomes the sampling distribution
    • SE = sqrt( p * (1-p) / n )
    • n * p and n * (1-p) should both be greater than or equal to 10

Example code includes:

load("./RInputFiles/gss.RData")
glimpse(gss)
## Observations: 50,346
## Variables: 28
## $ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ year     <dbl> 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1982,...
## $ age      <fct> 41, 49, 27, 24, 57, 29, 21, 68, 54, 80, 74, 30, 53, 3...
## $ class    <fct> WORKING CLASS, WORKING CLASS, MIDDLE CLASS, MIDDLE CL...
## $ degree   <fct> LT HIGH SCHOOL, HIGH SCHOOL, HIGH SCHOOL, HIGH SCHOOL...
## $ sex      <fct> MALE, FEMALE, FEMALE, FEMALE, MALE, MALE, FEMALE, MAL...
## $ marital  <fct> MARRIED, MARRIED, NEVER MARRIED, NEVER MARRIED, NEVER...
## $ race     <fct> WHITE, WHITE, WHITE, WHITE, WHITE, WHITE, WHITE, WHIT...
## $ region   <fct> NEW ENGLAND, NEW ENGLAND, NEW ENGLAND, NEW ENGLAND, N...
## $ partyid  <fct> "STRONG DEMOCRAT", "STRONG DEMOCRAT", "IND,NEAR DEM",...
## $ happy    <fct> PRETTY HAPPY, NOT TOO HAPPY, VERY HAPPY, PRETTY HAPPY...
## $ grass    <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ relig    <fct> CATHOLIC, CATHOLIC, CATHOLIC, CATHOLIC, CATHOLIC, CAT...
## $ cappun2  <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ cappun   <fct> FAVOR, FAVOR, FAVOR, OPPOSE, OPPOSE, FAVOR, OPPOSE, F...
## $ finalter <fct> STAYED SAME, WORSE, BETTER, BETTER, STAYED SAME, BETT...
## $ protest3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ natspac  <fct> ABOUT RIGHT, TOO MUCH, TOO LITTLE, TOO LITTLE, ABOUT ...
## $ natarms  <fct> TOO LITTLE, TOO LITTLE, ABOUT RIGHT, TOO MUCH, TOO LI...
## $ conclerg <fct> ONLY SOME, ONLY SOME, A GREAT DEAL, ONLY SOME, A GREA...
## $ confed   <fct> ONLY SOME, ONLY SOME, ONLY SOME, ONLY SOME, A GREAT D...
## $ conpress <fct> ONLY SOME, ONLY SOME, A GREAT DEAL, ONLY SOME, A GREA...
## $ conjudge <fct> HARDLY ANY, ONLY SOME, A GREAT DEAL, A GREAT DEAL, A ...
## $ consci   <fct> ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEAL, A G...
## $ conlegis <fct> ONLY SOME, ONLY SOME, ONLY SOME, ONLY SOME, A GREAT D...
## $ zodiac   <fct> TAURUS, CAPRICORN, VIRGO, PISCES, CAPRICORN, LEO, LIB...
## $ oversamp <dbl> 1.235, 1.235, 1.235, 1.235, 1.235, 1.235, 1.235, 1.23...
## $ postlife <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
# Subset data from 2016
gss2016 <- gss %>%
  filter(year == 2016)

gss2016 %>% count(consci)
## Warning: Factor `consci` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 4 x 2
##   consci           n
##   <fct>        <int>
## 1 A GREAT DEAL   791
## 2 ONLY SOME      976
## 3 HARDLY ANY     117
## 4 <NA>           983
gss2016 <- gss2016 %>%
    mutate(old_consci=consci, 
           consci=fct_other(fct_recode(old_consci, "High"="A GREAT DEAL"), keep="High", other_level="Low")
           )
gss2016 %>% count(consci)
## Warning: Factor `consci` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 2
##   consci     n
##   <fct>  <int>
## 1 High     791
## 2 Low     1093
## 3 <NA>     983
# Plot distribution of consci
ggplot(gss2016, aes(x = consci)) +
  geom_bar()

# Compute proportion of high conf
p_hat <- gss2016 %>%
  summarize(p = mean(consci == "High", na.rm = TRUE)) %>%
  pull()


# Load the infer package
library(infer)

# Create single bootstrap data set
b1 <- gss2016 %>%
    specify(response = consci, success = "High") %>%
    generate(reps = 1, type = "bootstrap")
## Warning: Removed 983 rows containing missing values.
# Plot distribution of consci
ggplot(b1, aes(x = consci)) +
  geom_bar()

# Compute proportion with high conf
b1 %>%
  summarize(p = mean(consci == "High")) %>%
  pull()
## [1] 0.4187898
# Create bootstrap distribution for proportion that favor
boot_dist <- gss2016 %>%
  specify(response = consci, success = "High") %>%
  generate(reps = 500) %>%
  calculate(stat = "prop", success = "High", na.rm = TRUE)
## Warning: Removed 983 rows containing missing values.
## Setting `type = "bootstrap"` in `generate()`.
# Plot distribution
ggplot(boot_dist, aes(x=stat)) +
  geom_density()

# Compute estimate of SE
SE <- boot_dist %>%
  summarize(se = sd(stat)) %>%
  pull()

# Create CI
c(p_hat - 2*SE, p_hat + 2*SE)
## [1] 0.3964511 0.4432517
# Two new smaller data sets have been created for you from gss2016: gss2016_small, which contains 50 observations, and gss2016_smaller which contains just 10 observations

id50 <- c(6, 98, 2673, 1435, 1535, 525, 2784, 1765, 163, 1859, 2497, 1780, 184, 575, 2781, 2310, 1677, 2478, 1226, 2350, 1139, 1635, 1350, 1809, 1842, 1501, 1502, 2610, 2456, 49, 56, 2167, 2401, 2002, 2343, 2012, 860, 2557, 1147, 1119, 2449, 695, 1511, 666, 1595, 1094, 2643, 769, 1263, 2426)
id10 <- c(1609, 1342, 2066, 2710, 1809, 503, 1889, 486, 1469, 6)

gss2016_small <- gss2016 %>%
    filter(id %in% id50)
gss2016_smaller <- gss2016 %>%
    filter(id %in% id10)

# Create bootstrap distribution for proportion
boot_dist_small <- gss2016_small %>%
  specify(response = consci, success = "High") %>%
  generate(reps = 500, type = "bootstrap") %>%
  calculate(stat = "prop")

# Compute estimate of SE
SE_small_n <- boot_dist_small %>%
  summarize(se = sd(stat)) %>%
  pull()

# Create bootstrap distribution for proportion
boot_dist_smaller <- gss2016_smaller %>%
  specify(response = consci, success = "High") %>%
  generate(reps = 500, type = "bootstrap") %>%
  calculate(stat = "prop")

# Compute estimate of SE
SE_smaller_n <- boot_dist_smaller %>%
  summarize(se = sd(stat)) %>%
  pull()

c(SE_small_n, SE_smaller_n)
## [1] 0.07206823 0.14608464
# Create bootstrap distribution for proportion that have hardy any
boot_dist <- gss2016 %>%
  specify(response=consci,  success = "Low") %>%
  generate(reps=500, type="bootstrap") %>%
  calculate(stat = "prop", na.rm = TRUE)
## Warning: Removed 983 rows containing missing values.
# Compute estimate of SE
SE_low_p <- boot_dist %>%
    summarize(se = sd(stat)) %>%
    pull()


# Compute p-hat and n
p_hat <- gss2016_small %>% 
    summarize(p = mean(consci == "High", na.rm=TRUE)) %>%
    pull()
n <- nrow(gss2016_small)

# Check conditions
p_hat * n >= 10
## [1] TRUE
(1 - p_hat) * n >= 10
## [1] TRUE
# Calculate SE
SE_approx <- sqrt(p_hat * (1 - p_hat) / n)

# Form 95% CI
c(p_hat - 2 * SE_approx, p_hat + 2 * SE_approx)
## [1] 0.242712 0.517288

Chapter 2 - Proportions (Testing and Power)

Hypothesis test for a proportion:

  • The hypothesis test for a proportion looks at what sort of p-hat would be observed if p held a specific value
    • The hypothesize() function prior to generate() sets out the hypothesis in question
  • Suppose that analysis is being run on whether people favor capital punishment
    • null <- gss2016 %>% specify(response=cappun, success=“FAVOR”) %>% hypothesize(null=“point”, p=0.5) %>% generate(reps=500, type=“simulate”) %>% calculate(stat=“prop”)
    • null %>% summarize(mean(stat > p_hat)) %>% pull() * 2 # The times 2 is for a two-sided test

Intervals for differences:

  • Can also look at differences in proportions, for example men vs. women belief in afterlife
  • Can generate null data by rewording the null hypothesis to “there is no association between belief in the afterlife and gender” - enables test by permutation
    • gss2016 %>% specify(response=postlife, explanatory=sex, success=“YES”) %>% hypothesize(null=“independence”) %>% generate(reps=1, type=“permute”)
    • gss2016 %>% specify(postlife ~ sex, success=“YES”) %>% hypothesize(null=“independence”) %>% generate(reps=1, type=“permute”) # can use formula notation; same command as above, but simplified
    • null <- gss2016 %>% specify(postlife ~ sex, success=“YES”) %>% hypothesize(null=“independence”) %>% generate(reps=500, type=“permute”) %>% calculate(stat=“diff in props”, order=c(“FEMALE”, “MALE”)) # Full command
    • null %>% summarize(mean(stat > d_hat)) %>% pull() * 2

Statistical errors:

  • Type I errors - probability of rejecting a true null hypothesis - will happen with probability alpha
  • Type II errors - probability of not rejecting a false null hypothesis - will happen with probability beta, meaning the test has power 1-beta

Example code includes:

# Construct plot
ggplot(gss2016, aes(x = postlife)) + 
    geom_bar()

# Compute and save proportion that believe
p_hat <- gss2016 %>%
    summarize(mean(postlife == "YES", na.rm = TRUE)) %>%
    pull()

# Generate one data set under H0
sim1 <- gss2016 %>%
    specify(response = postlife, success = "YES") %>%
    hypothesize(null = "point", p = 0.75) %>%
    generate(reps = 1, type = "simulate")
## Warning: Removed 279 rows containing missing values.
# Construct plot
ggplot(sim1, aes(x=postlife)) +
    geom_bar()

# Compute proportion that believe
sim1 %>%
    summarize(mean(postlife == "YES")) %>%
    pull()
## [1] 0.7472952
# Generate null distribution
null <- gss2016 %>%
    specify(response = postlife, success = "YES") %>%
    hypothesize(null = "point", p = .75) %>%
    generate(reps = 100, type = "simulate") %>%
    calculate(stat = "prop")
## Warning: Removed 279 rows containing missing values.
# Visualize null distribution
ggplot(null, aes(x = stat)) +
    geom_density() +
    geom_vline(xintercept = p_hat, color = "red")

# Compute the two-tailed p-value
null %>%
    summarize(mean(stat > p_hat)) %>%
    pull() * 2
## [1] 0
# Plot distribution
ggplot(gss2016, aes(x = sex, fill = cappun)) +
    geom_bar(position = "fill")

# Compute two proportions
p_hats <- gss2016 %>%
    group_by(sex) %>%
    summarize(mean(cappun == "FAVOR", na.rm = TRUE)) %>%
    pull()

# Compute difference in proportions
d_hat <- diff(p_hats)


# Create null distribution
null <- gss2016 %>%
    specify(cappun ~ sex, success = "FAVOR") %>%
    hypothesize(null = "independence") %>%
    generate(reps = 500, type = "permute") %>%
    calculate(stat = "diff in props", order = c("FEMALE", "MALE"))
## Warning: Removed 172 rows containing missing values.
# Visualize null
ggplot(null, aes(x = stat)) +
    geom_density() +
    geom_vline(xintercept = d_hat, col = "red")

# Compute two-tailed p-value
null %>%
    summarize(mean(stat < d_hat)) %>%
    pull() * 2
## [1] 0
# Create the bootstrap distribution
boot <- gss2016 %>%
    specify(cappun ~ sex, success="FAVOR") %>%
    generate(reps=500, type="bootstrap") %>%
    calculate(stat = "diff in props", order = c("FEMALE", "MALE"))
## Warning: Removed 172 rows containing missing values.
# Compute the standard error
SE <- boot %>%
    summarize(sd(stat)) %>%
    pull()
  
# Form the CI (lower, upper)
c( d_hat - 2*SE, d_hat + 2*SE )
## [1] -0.12636862 -0.05205316
gssmod <- gss2016 %>%
    mutate(coinflip=sample(c("heads", "tails"), size=nrow(.), replace=TRUE))
table(gssmod$coinflip)
## 
## heads tails 
##  1434  1433
# Find difference in props
p_hats <- gssmod %>%
    group_by(coinflip) %>%
    summarize(mean(cappun == "FAVOR", na.rm = TRUE)) %>%
    pull()

# Compute difference in proportions
d_hat <- diff(p_hats)

# Form null distribution
null <- gssmod %>%
    specify(cappun ~ coinflip, success = "FAVOR") %>%
    hypothesize(null = "independence") %>%
    generate(reps = 500, type = "permute") %>%
    calculate(stat = "diff in props", order = c("heads", "tails"))
## Warning: Removed 172 rows containing missing values.
ggplot(null, aes(x = stat)) +
    geom_density() +
    geom_vline(xintercept = d_hat, color = "red")

# Set alpha
alpha <- 0.05

# Find cutoffs
upper <- null %>%
    summarize(quantile(stat, probs = c(1-alpha/2))) %>%
    pull()
lower <- null %>%
    summarize(quantile(stat, probs = alpha/2)) %>%
    pull()
  
# Visualize cutoffs
ggplot(null, aes(x = stat)) +
    geom_density() +
    geom_vline(xintercept = d_hat, color = "red") +
    geom_vline(xintercept = lower, color = "blue") +
    geom_vline(xintercept = upper, color = "blue")

# check if inside cutoffs
d_hat %>%
    between(lower, upper)
## [1] TRUE

Chapter 3 - Comparing Many Parameters (Independence)

Contingency tables:

  • Can look at bivariate relationships, such as political party affiliation vs. opinions on military spending
    • The broom package can help in movements to/from contingency tables, by keeping things cleaner
    • tab <- gss2016 %>% select(natarms, party) %>% table()
    • tab %>% broom::tidy() %>% uncount(Freq)

Chi-squared test statistic:

  • Can use Chi-squared to look at dependence of variables
  • Can create a contingency table O of the observations and a contingency table E of the expected distribution if there is pure independence
    • Can then look at (O-E)**2 / E, and sum up to get the overall Chi-squared distribution
    • Hypothesis tests can then assess how extreme a given Chi-squared may be

Alternative method - chi-squared test statistic:

  • The Chi-squared statistic is derived from the Chi-squared distribution, which is specified solely by the number of degrees of freedom
    • The degrees of freedom are (nRows - 1) * (nCols - 1)
    • pchisq(chi_obs_spac, df=4) # gives the likelihood of actual being less than, can use 1-pchisq() for the amount that is greater (the p-value of interest
  • Generally, need to have 5+ counts per cell, and to only use chi-squared for df=2+ (for df=1, can just compare proportions using the normal distribution)

Intervals for chi-squared:

  • Can remove the hypothesize() call and use bootstrap() instead, but there is no real meaning to a Chi-squared in the absence of a null hypothesis
  • It is very unlikely that you would ever see a confidence interval attached to a Chi-squared interval

Example code includes:

# Exclude "other" party
gss_party <- gss2016 %>%
    mutate(party=fct_collapse(partyid, 
                              "D"=c("STRONG DEMOCRAT", "NOT STR DEMOCRAT"), 
                              "R"=c("NOT STR REPUBLICAN", "STRONG REPUBLICAN"),
                              "I"=c("IND,NEAR DEM", "INDEPENDENT", "IND,NEAR REP"),
                              "O"="OTHER PARTY"
                              )
           ) %>%
    filter(!is.na(party), party != "O") %>%
    droplevels()

# Bar plot of proportions
gss_party %>%
    ggplot(aes(x = party, fill = natspac)) +
    geom_bar(position = "fill")

# Bar plot of counts
gss_party %>%
    ggplot(aes(x=party, fill = natspac)) +
    geom_bar()

# Create table of natspac and party
O <- gss_party %>%
    select(natspac, party) %>%
    table()

# Convert table back to tidy df
O %>%
    broom::tidy() %>%
    uncount(n)
## # A tibble: 1,249 x 2
##    natspac    party
##    <chr>      <chr>
##  1 TOO LITTLE D    
##  2 TOO LITTLE D    
##  3 TOO LITTLE D    
##  4 TOO LITTLE D    
##  5 TOO LITTLE D    
##  6 TOO LITTLE D    
##  7 TOO LITTLE D    
##  8 TOO LITTLE D    
##  9 TOO LITTLE D    
## 10 TOO LITTLE D    
## # ... with 1,239 more rows
# Create one permuted data set
perm_1 <- gss_party %>%
    specify(natarms ~ party) %>%
    hypothesize(null = "independence") %>%
    generate(reps = 1, type = "permute")
## Warning: Removed 1412 rows containing missing values.
# Visualize permuted data
ggplot(perm_1, aes(x = party, fill = natarms)) +
    geom_bar()

# Make contingency table
tab <- perm_1 %>%
    ungroup() %>%
    select(natarms, party) %>%
    table()
  
# Compute chi-squared stat
(chi_obs_arms <- chisq.test(tab)$statistic)
## X-squared 
##   1.34665
(chi_obs_spac <- chisq.test(gss_party$natspac, gss_party$party)$statistic)
## X-squared 
##  7.568185
# Create null
null <- gss_party %>%
    specify(natspac ~ party) %>%
    hypothesize(null = "independence") %>%
    generate(reps = 100, type = "permute") %>%
    calculate(stat = "Chisq")
## Warning: Removed 1514 rows containing missing values.
# Visualize H_0 and obs
ggplot(null, aes(x = stat)) +
    geom_density() +
    geom_vline(xintercept = chi_obs_spac, color = "red")

# Create null
null <- gss_party %>%
    specify(natarms ~ party) %>%
    hypothesize(null = "independence") %>%
    generate(reps = 100, type = "permute") %>%
    calculate(stat = "Chisq")
## Warning: Removed 1412 rows containing missing values.
# Visualize H_0 and obs
ggplot(null, aes(x = stat)) +
    geom_density() +
    geom_vline(xintercept = chi_obs_arms, color = "red")

# create bar plot
gss2016 %>%
    ggplot(aes(x = region, fill = happy)) +
    geom_bar(position = "fill") +
    coord_flip()

# create table
tab <- gss2016 %>%
    select(happy, region) %>%
    table()
  
# compute observed statistic
(chi_obs_stat <- chisq.test(tab)$statistic)
## X-squared 
##  12.60899
# generate null distribution
null <- gss2016 %>%
    mutate(happy=fct_other(happy, keep=c("VERY HAPPY"))) %>%
    specify(happy ~ region, success = "VERY HAPPY") %>%
    hypothesize(null = "independence") %>%
    generate(reps = 500, type = "permute") %>%
    calculate(stat = "Chisq")
## Warning: Removed 8 rows containing missing values.
# plot null(s)
ggplot(null, aes(x = stat)) +
    geom_density() +
    geom_vline(xintercept = chi_obs_stat) +
    stat_function(fun = dchisq, args = list(df = (9-1)*(2-1)), color = "blue")

# permutation p-value
null %>% 
    summarize(mean(stat > chi_obs_stat)) %>% 
    pull()
## [1] 0.116
# approximation p-value
1 - pchisq(chi_obs_stat, df = (9-1)*(2-1))
## X-squared 
## 0.1260301

Chapter 4 - Comparing Many Parameters (Goodness of Fit)

Case Study: Election Fraud:

  • Election fraud has many meanings; this course will focus on altering vote totals
  • Benford’s Law applies when looking at broad collections of data, and considering only the first digit
    • The law proposed that 30.1% of the first digits should be 1, with decreases as the numbers increase
    • The basic idea is that the 1’s always happen first (get to the 100s before any other x00s)
  • Can look at the 2009 Iranian election, and assess in comparison to Benford’s Law

Goodness of Fit:

  • Desire to assess whether the voter data is well aligned with Benford’s law - Chi-squared is a good statistic for this
    • chisq.test(myTab, p=myProbNull)
  • Can simulate the null hypothesis, for example by using
    • gss2016 %>% specify(response=party) %>% hypothesize(null=“point”, p=p_uniform) %>% generate(reps=1, type=“simulate”)

Now to the US:

  • Comparison to the US election in Iowa in 2016
  • Can look at county-level data

Wrap-Up:

  • Could have rejected the null hypothesis even when it is true - typically 5%
  • More fundamental errors could be at play, such as assuming the first digit should follow Benford’s Law
    • Population of world cities tend to fit Benford’s Law criteria (uniform distribution, consistency of logs, etc.)
  • Techniques for carrying out inference on categorical data - confidence intervals, hypothesis tests, Chi-squared tests for independence, goodness of fit of distributions
  • All tests follow specify-hypohteize-generate-calculate

Example code includes:

iran <- readr::read_csv("./RInputFiles/iran.csv")
## Parsed with column specification:
## cols(
##   province = col_character(),
##   city = col_character(),
##   ahmadinejad = col_double(),
##   rezai = col_double(),
##   karrubi = col_double(),
##   mousavi = col_double(),
##   total_votes_cast = col_double(),
##   voided_votes = col_double(),
##   legitimate_votes = col_double()
## )
glimpse(iran)
## Observations: 366
## Variables: 9
## $ province         <chr> "East Azerbaijan", "East Azerbaijan", "East A...
## $ city             <chr> "Azar Shahr", "Asko", "Ahar", "Bostan Abad", ...
## $ ahmadinejad      <dbl> 37203, 32510, 47938, 38610, 36395, 435728, 20...
## $ rezai            <dbl> 453, 481, 568, 281, 485, 9830, 166, 55, 442, ...
## $ karrubi          <dbl> 138, 468, 173, 53, 190, 3513, 74, 46, 211, 12...
## $ mousavi          <dbl> 18312, 18799, 26220, 12603, 33695, 419983, 14...
## $ total_votes_cast <dbl> 56712, 52643, 75500, 51911, 71389, 876919, 35...
## $ voided_votes     <dbl> 606, 385, 601, 364, 624, 7865, 195, 102, 634,...
## $ legitimate_votes <dbl> 56106, 52258, 74899, 51547, 70765, 869054, 35...
# Compute candidate totals
totals <- iran %>%
  summarize(ahmadinejad = sum(ahmadinejad),
            rezai = sum(rezai),
            karrubi = sum(karrubi),
            mousavi = sum(mousavi))

# Plot totals
totals %>%
  gather(key = "candidate", value = "votes") %>%
  ggplot(aes(x = candidate, y = votes)) +
  geom_bar(stat = "identity")

# Cities won by #2
iran %>%
  group_by(province) %>%
  summarize(ahmadinejad = sum(ahmadinejad),
            mousavi = sum(mousavi)) %>%
  mutate(mousavi_win = mousavi > ahmadinejad) %>%
  filter(mousavi_win)
## # A tibble: 2 x 4
##   province               ahmadinejad mousavi mousavi_win
##   <chr>                        <dbl>   <dbl> <lgl>      
## 1 Sistan and Baluchestan      450269  507946 TRUE       
## 2 West Azerbaijan             623946  656508 TRUE
# Print get_first
get_first <- function(x) {
    substr(as.character(x), 1, 1) %>%
      as.numeric() %>%
      as.factor()
}

# Create first_digit
iran2 <- iran %>%
  mutate(first_digit = get_first(total_votes_cast))
  
# Construct barchart
iran2 %>%
  ggplot(aes(x=first_digit)) +
  geom_bar()

# Tabulate the counts of each digit
tab <- iran2 %>%
  select(first_digit) %>%
  table()

# Compute observed stat
p_benford <- c(0.301029995663981, 0.176091259055681, 0.1249387366083, 0.0969100130080564, 0.0791812460476248, 0.0669467896306132, 0.0579919469776867, 0.0511525224473813, 0.0457574905606751)
names(p_benford) <- 1:9
p_benford[9] <- 1 - sum(p_benford[-9])
sum(p_benford)
## [1] 1
chi_obs_stat <- chisq.test(tab, p = p_benford)$stat

# Form null distribution
null <- iran2 %>%
  specify(response=first_digit) %>%
  hypothesize(null = "point", p = p_benford) %>%
  generate(reps=500, type = "simulate") %>%
  calculate(stat = "Chisq")


# plot both nulls
ggplot(null, aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = chi_obs_stat) + 
  stat_function(fun = dchisq, args = list(df = 9-1), color = "blue")

# permutation p-value
null %>%
  summarize(mean(stat > chi_obs_stat)) %>%
  pull()
## [1] 0.006
# approximation p-value
pchisq(chi_obs_stat, df=9-1, lower.tail=FALSE)
##   X-squared 
## 0.006836367
iowa <- readr::read_csv("./RInputFiles/iowa.csv")
## Parsed with column specification:
## cols(
##   office = col_character(),
##   candidate = col_character(),
##   party = col_character(),
##   county = col_character(),
##   votes = col_double()
## )
glimpse(iowa)
## Observations: 1,386
## Variables: 5
## $ office    <chr> "President/Vice President", "President/Vice Presiden...
## $ candidate <chr> "Evan McMullin / Nathan Johnson", "Under Votes", "Ga...
## $ party     <chr> "Nominated by Petition", NA, "Libertarian", NA, "Soc...
## $ county    <chr> "Adair", "Adair", "Adair", "Adair", "Adair", "Adair"...
## $ votes     <dbl> 10, 32, 127, 5, 0, 10, 1133, 14, 3, 2461, 3848, 38, ...
# Get R+D county totals
iowa2 <- iowa %>%
  filter(candidate == "Hillary Clinton / Tim Kaine" | candidate == "Donald Trump / Mike Pence") %>%
  group_by(county) %>%
  summarize(dem_rep_votes = sum(votes, na.rm = TRUE)) 

# Add first_digit
iowa3 <- iowa2 %>%
  mutate(first_digit = get_first(dem_rep_votes))

# Construct bar plot
iowa3 %>%
  ggplot(aes(x=first_digit)) + 
  geom_bar()

# Tabulate the counts of each digit
tab <- iowa3 %>%
  select(first_digit) %>%
  table()

# Compute observed stat
chi_obs_stat <- chisq.test(tab, p = p_benford)$stat
## Warning in chisq.test(tab, p = p_benford): Chi-squared approximation may be
## incorrect
# Form null distribution
null <- iowa3 %>%
  specify(response = first_digit) %>%
  hypothesize(null = "point", p = p_benford) %>%
  generate(reps = 500, type = "simulate") %>%
  calculate(stat = "Chisq")
## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect
## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect

## Warning in stats::chisq.test(table(first_digit)[p_levels], p = attr(x,
## "params")): Chi-squared approximation may be incorrect
# Visualize null
ggplot(null, aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = chi_obs_stat)


Building Dashboards with flexdashboard

Chapter 1 - Dashboard Layouts

Introduction:

  • Dashboards are a collection of components in a single display - graphs, text, tables, widgets, etc.
  • The flexdashboard is an R package that allows for using R Markdown to create a dashboard
    • Can include all the power of R
    • Can combine with Shiny for reactive elements
  • Course will include capabilities of flexdashboard, decision as to whether to incorporate Shiny, and potential extensions

Anatomy of flexdashboard:

  • Within R Markdown, the header controls the type of document created during knitting
    • output: flexdashboard::flex_dashboard (will create the flesdashboard output)
  • The flexdashboard is made up of charts, with each chart denoted by ### ChartName
    • The succeeding lines can then be R code, similar to other R Markdown processes
  • By default, all of the charts will stack in a single column, though multiple columns can also be declared
    • Columns are created using 14+ dashes, with everything underneath contained in that column
    • Can give a specific name and specify options for each of the columns
  • Can start in Rstudio using File - New File - R Markdown - From Template - flexdashboard
  • Course data will include bicycle sharing data from San Francisco

Layout basics:

  • Columns can be of variable width by using data-width= such that they add up to 1000
  • Can create by rows rather than columns using orientation:rows underneath flexdashboard::flex_dashboard:
    • Can use data-height to vary the row heights
    • Can use certical_layout: scroll as an option to allow for scrolling rather than forcing everything on to one page (this is considered poor dashboard design, though)

Advanced layouts:

  • Options for extending the dashboard include tabsets
    • Column {.tabset} - will apply the tabset to every chart in that column
  • Can also extend by using pages, where columns and rows are children of their respective pages
    • Sixteen (16) or more equal signs under a Page title specify a call to the page
    • Can add Page xxx {data-navmenu=yyy} to specify that the page xxx should belong to the navmenu yyy

Example code includes (not added due to need for separate dashboard file):


Chapter 2 - Data Visualization for Dashboards

Graphs:

  • The easiest way to add a graph is to include it as part of a code snippet in flexdashboard
  • Graphs will flex and resize to stay in the container
  • Sometimes, may want to set the figure width and height to match the aspect ratio of the target device as closely as possible
    • {r, fig.width=10, fig.height=5}
    • Downside #1 - trial and error needed
    • Downside #2 - need to adjust every time charts are added
    • Downside #3 - graphs are no longer responsive to user inputs

Web-Friendly Visualizations:

  • Web-friendly packages include plotly, highcharter, dygraphs, rbokeh, ggvis
  • The plotly calls are helpful since they are closely linked to ggplot2
    • library(plotly)
    • ggplotly(my_ggplot) # my_ggplot is a ggplot2 object

htmlwidgets:

  • htmlwidgets are a framework that connects R with Javascript (web-friendly and well-suited to dashboards)
  • The leaflet package allows for adding interactive maps
    • library(leaflet)
    • leaflet() %>% addTiles() %>% addMarkers(lng = data_df\(longitude, lat = data_df\)latitude)
    • leaflet(data_df) %>% addTiles() %>% addMarkers() # leaflet called on a data frame

Example code includes (not added due to need for separate dashboard file):


Chapter 3 - Dashboard Components

Highlighting Single Values:

  • Gauges can be helpful for values in a defined range, such as 0%-100%
    • gauge(value = pct_subscriber_trips, min = 0, max = 100) # basics for creating a gauge include the value, the min, and the max
    • gauge(value = pct_subscriber_trips, min = 0, max = 100, sectors = gaugeSectors( success = c(90, 100), warning = c(70, 89), danger = c(0, 69) ), symbol = ‘%’) # additional features
  • Value boxes can be helpful for values that do not fall in a pre-defined range
    • valueBox(prettyNum(num_trips, big.mark = “,”), caption = “Total Daily Trips”, icon = “fa-bicycle”) # font-awesome bicycles
  • Both gauges and value boxes can be linked
    • valueBox(prettyNum(num_trips, big.mark = ‘,’), caption = ‘Total Daily Trips’, icon = ‘fa-bicycle’, href = ‘#trip-raw-data’) # href makes the caption linked and clickable

Dashboard Tables:

  • The kable function from knitr is one of the easiest ways to create a table - but, not very well tuned to html
    • library(knitr)
    • kable(my_data_df)
  • The DT package is better suited to making responsive tables
    • library(DT)
    • datatable(my_data_df)
    • datatable(my_data_df, rownames = FALSE) # eliminate row numbering
    • datatable(my_data_df, rownames = FALSE, options = list(pageLength = 15)) # most options are set by way of a list; note the contrast to rownames
    • datatable( my_data_df, rownames = FALSE, extensions = ‘Buttons’, options = list( dom = ‘Bfrtip’, buttons = c(‘copy’, ‘csv’, ‘excel’, ‘pdf’, ‘print’) ) ) # buttons for extract

Text for Dashboards:

  • Captions (notes) are a common way to add text to a chart, and are added using a greater than sign with text; there must be an empty line between the end of the chunk and the caption
  • Another way to provide more context is with the storyboard format
    • Presents one chart at a time in a specified order, where the user controls the navigation speed between the charts
    • Good format for content that runs in order
    • Requires storyboard: true in the yaml header in the flexdashboard::flex_dashboard items
    • Within the story, the ### signal the next page of the story, and should have descriptive text
    • Can add commentary using the triple asterisk (***) which needs to be AFTER the R chunk and separated by at least one space
  • Can also mix in storyboard on some pages but not on others (requires leaving this out of the yaml header)
    • Add {.storyboard} to the end of the page description

Example code includes (not added due to need for separate dashboard file):


Chapter 4 - Adding Interactivity with Shiny

Incorporating Shiny into Dashboards:

  • Incorporating Shiny is optional but can mak the dashboards even more interactive
    • Shiny is interactive and lightweight, though at the expense of greater complication and hosting challenges
  • Even after incorporating Shiny, the flexdashboard document is still an interactive R Markdown document
  • runtime:shiny in the yaml header will make the flexdashboard in to a Shiny App

Reactive Dataframe Pattern:

  • Creating a narrow sidebar using
    • Column {data-width=200 .sidebar}
  • Widgets can be use inside of an R chunk, like any other dashboard component
    • sliderInput(“duration_slider”, label=“Select maximum trip duration to display (in minutes):”, min=0, max=120, value=15, step=5, dragRange=TRUE)
    • show_trips_df <- reactive({ trips_df %>% filter(duration_sec <= input$duration_slider * 60) })
    • To call the reactice data frame later, use show_trips_df()
  • Output from the reactive needs to be encloses in the appropriate render*() function
    • renderLeaflet({ show_trips_df() %>% . %>% leaflet() %>% . }) # no need for the output call like there would be in a typical Shiny document
  • Five key steps for the reactive data frame pattern
    • Create a sidebar column
    • Add user inputs to the sidebar - *Input() Shiny widgets
    • Make a data frame that uses the inputs, called later using ()
    • Replace the dataframe in the dashboard component code with the reactive version
    • Wrap with the appropriate rendering function render*()

Customized Inputs for Charts:

  • Can have a reactive component impact everything, as per the example worked through in the previous section
  • May also want to have sliders that only impact a single object
    • Putting these together in the same loaction can cause headaches due to the need to work with layouts
  • Example code to implement includes
    • fillCol(height=600, flex=c(NA, 1), inputPanel(sliderInput(“my_input”, .)), plotOutput(“my_plot”, height=“100%”)) # flex is the flexible height for the components
    • output$my_plot <- renderPlot({ . })
  • Can use a global shortcut
    • Can put all of the charts that are driven by the same slider on to the same page
    • Can put the sidebar as a class with its own page, followed by all the other pages, to have the same sidebar drive all of the pages

Wrap-up:

  • Additional resources available through Rstudio and htmlwidgets.org (information about all html widgets available in R)
    • The highcharter can be helpful - high quality charts with some interactivity
  • Can use shinydashboard for just Shiny if R Markdown and flexdasboard are not needed

Example code includes (not added due to need for separate dashboard file):


Network Analysis in R: Case Studies

Chapter 1 - Exploring Graphs Through Time

Exploring Data Set:

  • Daily snapshots of items purchased together (co-purchases) from Amazon in 2003
    • There is the to and from that will make up the graph, plus associated metadata
    • Desire to look only at a single data, and only the from-to data (assuming a directional graph)
  • Can look at a smaller subset of the graph, for example
    • sg <- induced_subgraph(amzn_g, 1:500)
    • sg <- delete.vertices(sg, degree(sg) == 0)
    • plot(sg, vertex.label = NA, edge.arrow.width = 0, edge.arrow.size = 0, margin = 0, vertex.size = 2)
  • Can count the number of diad (2-connect) and triad (3-connect) in the data
    • null, asymmetric, and mutual are the potential results for diads
    • Three-digit codes are used to reflect the 16 potential triad states - #Bi/#Assym/#Uncon - plus a letter D, U, and C

Exploring Temporal Structure:

  • Dataset has 4 days worth of data - can build from the earliest date to the latest date
  • Can create a list to hold the igraphs at each time period, loop over the times, and then plot them
    • A handful of vertices may be important and interesting across time

Example code includes:

library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
amzn_g <- read.graph("./RInputFiles/amzn_g.gml", format=c("gml"))
amzn_g
## IGRAPH 5125b42 DN-- 10245 10754 -- 
## + attr: id (v/n), name (v/c)
## + edges from 5125b42 (vertex names):
##  [1] 44  ->42   179 ->71   410 ->730  415 ->741  656 ->1267 669 ->672 
##  [7] 672 ->669  689 ->690  689 ->1284 690 ->689  690 ->1284 730 ->410 
## [13] 741 ->909  786 ->1767 802 ->806  806 ->802  856 ->205  857 ->211 
## [19] 867 ->866  868 ->866  909 ->741  911 ->748  921 ->190  1015->151 
## [25] 1016->1015 1047->1049 1049->1047 1204->1491 1267->656  1272->669 
## [31] 1278->152  1282->943  1284->689  1285->1286 1286->1285 1290->1293
## [37] 1293->1290 1293->1606 1294->1295 1295->1294 1312->730  1350->2783
## [43] 1362->156  1366->190  1438->1580 1438->1581 1467->3996 1479->158 
## + ... omitted several edges
# Perform dyad census
dc <- dyad_census(amzn_g)

# Perform triad census
tc <- triad_census(amzn_g)

# Find the edge density
ed <- edge_density(amzn_g)

# Output values
print(dc)
## $mut
## [1] 3199
## 
## $asym
## [1] 4356
## 
## $null
## [1] 52467335
print(tc)
##  [1] 179089386743     44610360     32763436          215         1906
##  [6]          507         1198          457          118            0
## [11]          301          170          119           33          239
## [16]          288
print(ed)
## [1] 0.0001024681
# Calculate transitivity
transitivity(amzn_g)
## [1] 0.3875752
# Calculate reciprocity
amzn_rp <- reciprocity(amzn_g)

# Simulate our outputs
nv <- gorder(amzn_g)
ed <- edge_density(amzn_g)
rep_sim <- rep(NA, 1000)

# Simulate 
for(i in 1:1000){
  rep_sim[i] <- reciprocity(erdos.renyi.game(nv, ed, "gnp", directed = TRUE))
}

# Compare
quantile(rep_sim, c(0.25, .5, 0.975))
##          25%          50%        97.5% 
## 0.0000000000 0.0000000000 0.0005504297
print(amzn_rp)
## [1] 0.5949414
# Get the distribution of in and out degrees
table(degree(amzn_g, mode = "in"))
## 
##    0    1    2    3    4    5    6    7    8    9   11   12   17 
## 2798 5240 1549  424  139   50   20    7    9    5    1    2    1
table(degree(amzn_g, mode = "out"))
## 
##    0    1    2    3    4    5 
## 1899 6350 1635  313   45    3
# Find important products based on the ratio of out to in and look for extremes
imp_prod <- V(amzn_g)[degree(amzn_g, mode = "out") > 3 & degree(amzn_g, mode = "in") < 3]

## Output the vertices
print(imp_prod)
## + 8/10245 vertices, named, from 5125b42:
## [1] 1629   4545   6334   20181  62482  64344  155513 221085
ipFrom <- c(1629, 1629, 1629, 1629, 1629, 1629, 1629, 1629, 1629, 1629, 1629, 11163, 11163, 11163, 11163, 11163, 11163, 11163, 11163, 11163, 11163, 11163, 32129, 32129, 32129, 32129, 32129, 32129, 32129, 38131, 38131, 38131, 38131, 38131, 38131, 45282, 45282, 45282, 45282, 52831, 52831, 52831, 52831, 52831, 52831, 52831, 52831, 53591, 53591, 53591, 53591, 53591, 53591, 53591, 53591, 56427, 56427, 56427, 56427, 59706, 59706, 59706, 59706, 59706, 59706, 59706, 59706, 62482, 62482, 62482, 62482, 62482, 62482, 67038, 67038, 67038, 67038, 71192, 71192, 71192, 71192, 71192, 77957, 77957, 77957, 77957, 77957, 77957, 103733, 103733, 103733, 103733, 103733, 117841, 117841, 117841, 117841, 117841, 117841, 117841, 117841, 117841, 117841, 123808, 123808, 123808, 123808, 123808, 123808, 123808, 123808, 123808, 132757, 132757, 132757, 132757, 132757, 132757, 132757, 132757, 132757, 132757, 132757, 132757, 132757, 132757, 144749, 144749, 144749, 144749, 144749, 144749, 144749, 170830, 170830, 170830, 170830, 170830, 170830, 177282, 177282, 177282, 177282, 177282, 177282, 177432, 177432, 177432, 177432, 177432, 177432, 177432, 184526, 184526, 184526, 184526, 184526, 191825, 191825, 191825, 191825, 191825, 215668, 215668, 215668, 221085, 221085, 221085, 221085, 221085, 231604, 231604, 231604, 231604, 231604, 231604, 239014, 239014, 239014, 239014, 239014, 242693, 242693, 242693, 242693, 242693, 257621, 257621, 257621, 257621, 261587, 261587, 261587, 261587, 261587, 261587, 261657, 261657, 261657, 261657, 261657, 261657)
ipTo <- c(190, 1366, 2679, 4023, 1625, 1627, 7529, 1272, 1628, 1630, 1631, 11124, 15360, 20175, 10626, 20970, 10776, 11164, 11166, 5955, 8719, 11164, 23842, 23843, 24115, 15312, 23329, 32127, 80473, 44848, 44849, 44850, 38133, 31084, 33711, 10920, 20178, 20179, 87093, 2134, 2136, 4119, 9995, 36524, 64698, 64700, 52833, 120083, 120085, 120086, 36689, 12340, 113789, 32094, 51015, 1898, 10076, 15800, 61488, 63836, 63837, 63838, 8882, 59708, 59711, 26982, 59708, 69497, 69498, 69499, 69500, 23349, 62480, 58926, 58928, 64118, 52271, 71190, 71380, 75384, 9762, 57876, 43543, 43546, 98488, 77951, 77953, 116842, 103732, 103734, 103735, 103728, 124733, 117842, 117843, 117845, 117842, 117843, 117845, 117842, 117842, 117843, 117845, 59267, 89503, 89506, 156, 190, 105428, 184973, 195785, 195787, 132753, 132754, 132755, 52563, 132755, 132756, 132759, 132762, 126757, 132754, 132755, 132756, 189269, 265886, 43155, 80519, 159667, 82479, 152760, 136747, 65216, 114684, 114686, 114687, 117132, 132667, 81755, 109198, 109199, 109202, 144124, 75023, 216449, 139527, 149146, 152038, 177428, 177430, 177428, 177430, 56930, 61658, 207112, 250755, 250756, 56930, 141148, 191036, 147084, 245110, 175959, 177376, 177377, 88463, 103641, 115111, 165118, 228427, 43553, 76706, 78278, 131353, 75725, 119146, 12615, 15740, 229533, 151325, 237568, 239545, 239546, 239547, 110872, 215593, 60310, 60312, 133398, 44502, 261582, 261590, 261599, 271593, 261584, 261588, 261649, 261653, 261654, 261658, 261662, 105814)
ipGroupFrom <- factor(c('DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'Video', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'Video', 'Video', 'Video', 'Video', 'Video', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'Video', 'Video', 'Video', 'Video', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD', 'DVD'), levels=c("DVD", "Video"))
ipSRFrom <- c(30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 37, 37, 37, 37, 37, 37, 26, 26, 26, 26, 14, 14, 14, 14, 14, 14, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16, 10, 10, 10, 10, 1, 1, 1, 1, 1, 1, 1, 1, 19, 19, 19, 19, 19, 19, 10, 10, 10, 10, 5, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 17, 17, 17, 17, 17, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 27, 27, 27, 27, 27, 27, 27, 10, 10, 10, 10, 10, 10, 6, 6, 6, 6, 6, 6, 19, 19, 19, 19, 19, 19, 19, 25, 25, 25, 25, 25, 3, 3, 3, 3, 3, 8, 8, 8, 27, 27, 27, 27, 27, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 26, 26, 26, 26, 26, 15, 15, 15, 15, 8, 8, 8, 8, 8, 8, 26, 26, 26, 26, 26, 26)
ipSRTo <- c(5, 2, 18, 20, 12, 6, 8, 14, 16, 4, 18, 20, 3, 6, 14, 5, 3, 3, 4, 3, 13, 3, 5, 9, 18, 17, 8, 2, 8, 9, 16, 9, 24, 11, 25, 6, 9, 3, 21, 1, 5, 2, 24, 2, 6, 6, 8, 18, 7, 4, 20, 6, 22, 13, 10, 19, 4, 22, 7, 7, 9, 7, 11, 21, 12, 17, 21, 5, 7, 2, 1, 26, 6, 14, 2, 17, 4, 13, 12, 6, 8, 13, 4, 7, 1, 7, 9, 15, 19, 6, 20, 0, 19, 14, 18, 11, 14, 18, 11, 14, 14, 18, 11, 16, 1, 5, 3, 5, 6, 22, 5, 20, 10, 29, 9, 22, 9, 12, 10, 9, 12, 29, 9, 12, 13, 6, 23, 6, 18, 10, 18, 6, 9, 11, 8, 8, 19, 12, 10, 9, 8, 14, 1, 7, 10, 13, 18, 6, 6, 4, 6, 4, 4, 22, 5, 8, 4, 4, 13, 11, 3, 4, 21, 22, 8, 18, 1, 6, 5, 5, 4, 8, 6, 12, 6, 3, 13, 8, 10, 1, 1, 22, 12, 18, 19, 5, 18, 31, 8, 13, 10, 14, 25, 4, 19, 17, 5, 21, 3, 1, 19, 10)
ipTRFrom <- c(290, 290, 290, 290, 290, 290, 290, 290, 290, 290, 290, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 73, 73, 73, 73, 73, 73, 73, 294, 294, 294, 294, 294, 294, 43, 43, 43, 43, 5, 5, 5, 5, 5, 5, 5, 5, 13, 13, 13, 13, 13, 13, 13, 13, 28, 28, 28, 28, 1, 1, 1, 1, 1, 1, 1, 1, 110, 110, 110, 110, 110, 110, 7, 7, 7, 7, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 25, 25, 25, 25, 25, 25, 25, 2, 2, 2, 2, 2, 2, 12, 12, 12, 12, 12, 12, 111, 111, 111, 111, 111, 111, 111, 294, 294, 294, 294, 294, 0, 0, 0, 0, 0, 0, 0, 0, 243, 243, 243, 243, 243, 43, 43, 43, 43, 43, 43, 15, 15, 15, 15, 15, 483, 483, 483, 483, 483, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 2, 2, 2, 2, 2, 2)
ipTRTo <- c(19, 2, 22, 105, 22, 1, 6, 55, 40, 21, 47, 13, 0, 42, 14, 51, 2, 4, 0, 2, 41, 4, 0, 19, 21, 63, 5, 0, 2, 4, 63, 63, 7, 1, 8, 11, 134, 134, 12, 5, 10, 3, 58, 1, 6, 2, 27, 39, 2, 18, 87, 12, 218, 2, 30, 17, 0, 41, 13, 9, 3, 2, 13, 8, 10, 1, 8, 1, 0, 7, 1, 167, 63, 28, 0, 6, 1, 10, 4, 0, 2, 0, 5, 2, 3, 2, 2, 12, 24, 45, 21, 0, 8, 2, 21, 20, 2, 21, 20, 2, 2, 21, 20, 14, 6, 6, 3, 19, 13, 88, 4, 9, 6, 0, 19, 54, 19, 6, 9, 1, 2, 0, 19, 6, 3, 13, 46, 29, 6, 1, 15, 1, 4, 18, 28, 5, 15, 21, 10, 12, 3, 5, 4, 3, 8, 5, 0, 0, 5, 0, 5, 0, 1, 221, 1, 13, 3, 1, 7, 40, 5, 0, 8, 37, 67, 48, 0, 6, 1, 25, 1, 69, 0, 55, 3, 0, 5, 5, 2, 13, 0, 44, 53, 9, 4, 5, 13, 212, 3, 3, 1, 3, 8, 0, 3, 12, 11, 10, 5, 0, 49, 42)
ipTitleFrom <- c(16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 13, 13, 13, 13, 13, 13, 13, 30, 30, 30, 30, 30, 30, 11, 11, 11, 11, 26, 26, 26, 26, 26, 26, 26, 26, 5, 5, 5, 5, 5, 5, 5, 5, 23, 23, 23, 23, 22, 22, 22, 22, 22, 22, 22, 22, 18, 18, 18, 18, 18, 18, 25, 25, 25, 25, 14, 14, 14, 14, 14, 12, 12, 12, 12, 12, 12, 21, 21, 21, 21, 21, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 27, 27, 27, 27, 27, 27, 27, 27, 27, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 29, 29, 29, 29, 29, 29, 29, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30, 7, 7, 7, 7, 7, 6, 6, 6, 9, 9, 9, 9, 9, 17, 17, 17, 17, 17, 17, 10, 10, 10, 10, 10, 24, 24, 24, 24, 24, 19, 19, 19, 19, 3, 3, 3, 3, 3, 3, 28, 28, 28, 28, 28, 28)
ipNames <- c('Attraction', 'Barbara The Fair With The Silken Hair', 'Cannibal Apocalypse', "DJ Qbert's Wave Twisters", 'David and Lisa', 'Def Comedy Jam  Vol. 13', 'Detroit Lions 2001 NFL Team Video', 'Donnie McClurkin: Live in London and More', 'El Hombre Sin Sombra (Hollow Man)', 'Gladiator', 'Kindergarten Cop', "Kingsley's Meadow - Wise Guy", "Lady & The Tramp II - Scamp's Adventure", 'Lojong - Transforming the Mind (Boxed Set)', 'Menace II Society', 'Merlin', 'Modern Times', 'Murder by Numbers (Full Screen Edition)', 'Nancy Drew: A Haunting We Will Go', 'Princess Nine - Triple Play (Vol. 3)', 'Secret Agent AKA Danger Man  Set 2', 'Seguire Tus Pasos', 'Selena Remembered', 'Seven (New Line Platinum Series)', 'Sheba  Baby', 'Slaughter', 'The Complete Guide to Medicine Ball Training', 'The Gambler', 'The Getaway', 'The Sum of All Fears')
ip_df <- data.frame(X=1:202, 
                    from=ipFrom, 
                    to=ipTo, 
                    salesrank.from=ipSRFrom, 
                    salesrank.to=ipSRTo, 
                    totalreviews.from=ipTRFrom, 
                    totalreviews.to=ipTRTo, 
                    group.from=ipGroupFrom, 
                    title.from=factor(ipNames[ipTitleFrom], levels=ipNames)
                    )


# Create a new graph
ip_g <- graph_from_data_frame(ip_df %>% select(from, to), directed = TRUE)

# Add color to the edges based on sales rank, blue is higer to lower, red is lower to higher
E(ip_g)$rank_flag <- ifelse(ip_df$salesrank.from <= ip_df$salesrank.to, "blue", "red")

# Plot and add a legend
plot(ip_g, vertex.label = NA, edge.arrow.width = 1, edge.arrow.size = 0, 
    edge.width = 4, margin = 0, vertex.size = 4, 
    edge.color = E(ip_g)$rank_flag, vertex.color = "black" )
legend("bottomleft", legend = c("Lower to Higher Rank", "Higher to Lower Rank"), 
       fill = unique(E(ip_g)$rank_flag ), cex = .7)

# Get a count of out degrees for all vertices
# deg_ct <- lapply(time_graph, function(x){return(degree(x, mode = "out") )})

# Create a dataframe starting by adding the degree count
# deg_df <- data.frame(ct = unlist(deg_ct))

# Add a column with the vertex names 
# deg_df$vertex_name <- names(unlist(deg_ct))

# Add a time stamp 
# deg_df$date <- ymd(rep(d, unlist(lapply(time_graph, function(x){length(V(x))}))))

# See all the vertices that have more than three out degrees
# lapply(time_graph, function(x){return(V(x)[degree(x, mode = "out") > 3])})

# Create a dataframe to plot of three important vertices
# vert_df <- deg_df %>% filter(vertex_name %in% c(1629, 132757, 117841))

# Draw the plot to see how they change through time
# ggplot(vert_df, aes(x = date, y = ct, group = vertex_name, colour = vertex_name)) + geom_path()


# Calculate clustering and reciprocity metrics
# trans <- unlist(lapply(all_graphs, FUN=transitivity))
# rp <- unlist(lapply(all_graphs, FUN=reciprocity))

# Create daaframe for plotting
# met_df <- data.frame("metric" = c(trans, rp))

# Repeat the data
# met_df$date <- rep(ymd(d), 2)

# Sort and then Repeat the metric labels
# met_df$name <- sort(rep(c("clustering", "reciprocity"), 4))

# Plot
# ggplot(met_df, aes(x= date, y= metric, group = name, colour = name)) + geom_path()

Chapter 2 - Talk About R on Twitter

Creating retweet graphs:

  • Data is several days of tweets from #rstats - want to use retweets (starts with RT) to form the network
    • raw_tweets <- read.csv(“datasets/rstatstweets.csv”, stringsAsFactors = F)
    • all_sn <- unique(raw_tweets$screen_name)
    • rt_g <- graph.empty()
    • rt_g <- rt_g + vertices(all_sn)
    • for(i in 1:dim(raw_tweets)[1]){
    • rt_name <- find_rt(raw_tweets$tweet_text[i])
    • if(!is.null(rt_name)){
    •   if(!rt_name %in% all_sn){   
    •       rt_g <- rt_g + vertices(rt_name)  
    •   }  
    • rt_g <- rt_g + edges(c(raw_tweets$screen_name[i], rt_name))
    • }
    • }
    • sum(degree(rt_g) == 0)
    • rt_g <- simplify(rt_g)
    • rt_g <- delete.vertices(rt_g, degree(rt_g) == 0)

Building mentions graphs:

  • Tweets that mention someone can be a reply or a callout
  • There is more complexity than with the retweets, since there is no common format to a mention such as “starts with RT”
    • ment_g <- graph.empty()
    • ment_g <- ment_g + vertices(all_sn)
    • for(i in 1:dim(raw_tweets)[1]) {
    • ment_name <- mention_ext(raw_tweets$tweet_text[i])
    • if(length(ment_name) > 0 ) {
    •   for(j in ment_name) {  
    •       if(!j %in% all_sn) {  
    •           ment_g <- ment_g + vertices(j)  
    •       }  
    •   ment_g <- ment_g + edges(c(raw_tweets$screen_name[i], j))  
    •   }  
    • }
    • }
  • The mentions graph is significantly different, with many more small conversations shown by way of a sub-graph

Finding communities:

  • Communities are natural way to think of graphs - people who talk much more to each other than to the full network
    • ment_edg <- cluster_edge_betweenness(as.undirected(ment_g))
    • ment_eigen <- cluster_leading_eigen(as.undirected(ment_g))
    • ment_lp <- cluster_label_prop(as.undirected(ment_g))
    • length(ment_edg)
    • table(sizes(ment_edg))
  • Can compare similarities within community structures
    • compare(ment_edg, ment_eigen, method = ‘vi’) # “vi” is “variance information”
    • compare(ment_eigen, ment_lp, method = ‘vi’)
    • compare(ment_lp, ment_edg, method = ‘vi’)
  • Can also plot the community structures
    • lrg_eigen <- as.numeric(names(ment_eigen[which(sizes(ment_eigen) >45)]))
    • eigen_sg <- induced.subgraph(ment_g, V(ment_g)[ eigen %in% lrg_eigen])
  • plot(eigen_sg, vertex.label = NA, edge.arrow.width = .8, edge.arrow.size = 0.2,
  • coords = layout_with_fr(ment_sg), margin = 0, vertex.size = 6,
  • vertex.color = as.numeric(as.factor(V(eigen_sg)$eigen))
  • )

Example code includes:

rt_g <- read.graph("./RInputFiles/rt_g.gml", format=c("gml"))
rt_g
## IGRAPH 096a106 DN-- 4118 6052 -- 
## + attr: id (v/n), name (v/c)
## + edges from 096a106 (vertex names):
##  [1] thinkR_fr      ->thw_ch          thinkR_fr      ->omarwagih      
##  [3] KJMillidine    ->Rbloggers       earino         ->d4tagirl       
##  [5] ReecheshJC     ->KirkDBorne      SCMansbridge   ->rstudiotips    
##  [7] DeepSingularity->gp_pulipaka     chrisderv      ->thinkR_fr      
##  [9] chrisderv      ->_ColinFay       chrisderv      ->joshua_ulrich  
## [11] mtrost2        ->romain_francois mtrost2        ->rstudiotips    
## [13] mtrost2        ->RLangTip        dani_sola      ->rstudiotips    
## [15] hrhotz         ->rstudiotips     hrhotz         ->cboettig       
## + ... omitted several edges
# Calculate the number of nodes
gsize(rt_g)
## [1] 6052
# Calculate the number of edges
gorder(rt_g)
## [1] 4118
# Calculate the density
graph.density(rt_g)
## [1] 0.00035697
# Create the plot
plot(rt_g, vertex.label = NA, edge.arrow.width = .8, edge.arrow.size = 0.4, vertex.size = 3)

# Set the default color to black
V(rt_g)$color <- "black"

# Set the color of nodes that were retweeted just once to blue
V(rt_g)[degree(rt_g, mode = "in") == 1]$color <- "blue"

# Set the color of nodes that were retweeters just once to green 
V(rt_g)[degree(rt_g, mode = "out") == 1 ]$color <- "green"

# Plot the network
plot(rt_g, vertex.label = NA, edge.arrow.width = .8, 
    edge.arrow.size = 0.25, 
    vertex.size = 4, vertex.color = V(rt_g)$color)

# Set the default color to black
V(rt_g)$color <- "black"

# Set the color of nodes that were retweeted just once to blue
V(rt_g)[degree(rt_g, mode = "in") == 1 & degree(rt_g, mode = "out") == 0]$color <- "blue"

# Set the color of nodes that were retweeters just once to green 
V(rt_g)[degree(rt_g, mode = "in") == 0 & degree(rt_g, mode = "out") == 1 ]$color <- "green"

# Plot the network
plot(rt_g, vertex.label = NA, edge.arrow.width = .8, 
    edge.arrow.size = 0.25, 
    vertex.size = 4, vertex.color = V(rt_g)$color)

# Calculate betweenness
rt_btw <- igraph::betweenness(rt_g, directed = TRUE)

# Plot histogram
hist(rt_btw, breaks = 2000, xlim = c(0, 1000), main = "Betweenness")

# Calculate eigen centrality
rt_ec <- eigen_centrality(rt_g, directed = TRUE)

# Plot histogram
hist(rt_ec$vector, breaks = 100, xlim = c(0, .2), main = "Eigen Centrality")

# Get top 1% of vertices by eigen centrality
top_ec <- rt_ec$vector[rt_ec$vector > quantile(rt_ec$vector, .99)]

# Get top 1% of vertices by betweenness
top_btw <- rt_btw[rt_btw > quantile(rt_btw, .99)]

# Make a nice data frame to print, with three columns, Rank, Betweenness, and Eigencentrality
most_central <- as.data.frame(cbind(1:length(top_ec), names(sort(top_btw, decreasing = T)), 
                                    names(sort(top_ec, decreasing = T))
                                    )
                              )

# Set column names
colnames(most_central) <- c("Rank", "Betweenness", "Eigen Centrality")

# Print out the data frame
print(most_central)
##    Rank    Betweenness Eigen Centrality
## 1     1  hadleywickham        ma_salmon
## 2     2        kierisi      rstudiotips
## 3     3           drob          opencpu
## 4     4        opencpu     AchimZeileis
## 5     5      ma_salmon        dataandme
## 6     6       rmflight             drob
## 7     7      dataandme        _ColinFay
## 8     8      _ColinFay         rOpenSci
## 9     9     juliasilge        kearneymw
## 10   10      revodavid    RobertMylesMc
## 11   11       rOpenSci        ptrckprry
## 12   12     nj_tierney         rmflight
## 13   13     jonmcalder      thosjleeper
## 14   14      Md_Harris        revodavid
## 15   15   mauro_lepore       juliasilge
## 16   16       sckottie    RLadiesGlobal
## 17   17  RLadiesGlobal    hadleywickham
## 18   18      kearneymw     mauro_lepore
## 19   19      lenkiefer       JennyBryan
## 20   20       NumFOCUS         tudosgar
## 21   21         tjmahr         cboettig
## 22   22  TheRealEveret         antuki13
## 23   23     RLadiesMAD         jasdumas
## 24   24       jasdumas        Rbloggers
## 25   25     JennyBryan         rensa_co
## 26   26       hrbrmstr         timtrice
## 27   27       antuki13         daattali
## 28   28        Voovarb         johnlray
## 29   29       timtrice       joranelias
## 30   30      thinkR_fr     StatsbyLopez
## 31   31     benmarwick          kierisi
## 32   32  RosanaFerrero    joshua_ulrich
## 33   33     clquezadar        thinkR_fr
## 34   34       drsimonj           ledell
## 35   35        zentree           pssGuy
## 36   36      thomasp85     bastistician
## 37   37       OilGains          zentree
## 38   38    yodacomplex   brookLYNevery1
## 39   39  annakrystalli        Md_Harris
## 40   40 davidhughjones         sckottie
## 41   41       noamross       jonmcalder
## 42   42       AlexaLFH       nj_tierney
# Transform rt_btw and add as centrality
V(rt_g)$cent <-  log(rt_btw+2)

# Visualize
plot(rt_g, vertex.label = NA, edge.arrow.width = .2,
     edge.arrow.size = 0.0,
     vertex.size = unlist(V(rt_g)$cent), vertex.color = "red")

# Create subgraph 
rt_sub <-induced_subgraph(rt_g, V(rt_g)[V(rt_g)$cent >= quantile(V(rt_g)$cent, 0.99 )])

# Plot subgraph
plot(rt_sub, vertex.label = NA, edge.arrow.width = .2,
     edge.arrow.size = 0.0,
     vertex.size = unlist(V(rt_sub)$cent), vertex.color = "red")

ment_g <- read.graph("./RInputFiles/ment_g.gml", format=c("gml"))
ment_g
## IGRAPH 1c93213 DN-- 955 975 -- 
## + attr: id (v/n), name (v/c)
## + edges from 1c93213 (vertex names):
##  [1] thinkR_fr  ->ma_salmon       thinkR_fr  ->rstudio        
##  [3] thinkR_fr  ->rforjournalists thinkR_fr  ->aschinchon     
##  [5] thinkR_fr  ->zedsamurai      thinkR_fr  ->ikashnitsky    
##  [7] thinkR_fr  ->NSSDeviations   thinkR_fr  ->BeginTry       
##  [9] chrisderv  ->pbaumgartner    njogukennly->rstudio        
## [11] ma_salmon  ->rOpenSci        ma_salmon  ->RLadiesDC      
## [13] ma_salmon  ->marvin_dpr      ma_salmon  ->drob           
## [15] ma_salmon  ->kearneymw       ma_salmon  ->LucyStats      
## + ... omitted several edges
rt_ratio <- degree(rt_g, mode="in") / (degree(rt_g, mode="out"))
ment_ratio <- degree(ment_g, mode="in") / (degree(ment_g, mode="out"))

# Create a dataframe to plot with ggplot
ratio_df <- data.frame(io_ratio = c(ment_ratio, rt_ratio))
ratio_df["graph_type"] <- c(rep("Mention", length(ment_ratio)), rep("Retweet", length(rt_ratio)) )
ratio_df_filtered <- ratio_df %>% filter(!is.infinite(io_ratio) & io_ratio > 0)

# Plot the graph
ggplot(ratio_df, aes(x = io_ratio , fill= graph_type, group = graph_type)) +
  geom_density(alpha = .5) +
  xlim(0, 10)
## Warning: Removed 891 rows containing non-finite values (stat_density).

# Check the mean and median of each ratio
ratio_df %>% group_by(graph_type) %>% summarise(m_ratio = mean(io_ratio))
## # A tibble: 2 x 2
##   graph_type m_ratio
##   <chr>        <dbl>
## 1 Mention        Inf
## 2 Retweet        Inf
ratio_df %>% group_by(graph_type) %>% summarise(med = median(io_ratio))
## # A tibble: 2 x 2
##   graph_type   med
##   <chr>      <dbl>
## 1 Mention      Inf
## 2 Retweet        0
ratio_df %>% filter(io_ratio != +Inf) %>% group_by(graph_type) %>% summarise(m_ratio = mean(io_ratio))
## # A tibble: 2 x 2
##   graph_type m_ratio
##   <chr>        <dbl>
## 1 Mention      0.294
## 2 Retweet      0.268
ratio_df %>% filter(io_ratio != +Inf) %>% group_by(graph_type) %>% summarise(med = median(io_ratio))
## # A tibble: 2 x 2
##   graph_type   med
##   <chr>      <dbl>
## 1 Mention        0
## 2 Retweet        0
# Plot mention graph 
plot(ment_g, vertex.label = NA, edge.arrow.width = .8,
     edge.arrow.size = 0.2,
     margin = 0,
     vertex.size = 3)

# Find the assortivity of each graph
assortativity_degree(rt_g, directed = TRUE)
## [1] -0.1502212
assortativity_degree(ment_g, directed = TRUE)
## [1] -0.07742748
# Find the reciprocity of each graph
reciprocity(rt_g) 
## [1] 0.005948447
reciprocity(ment_g)
## [1] 0.01846154
# Get size 3 cliques
clq_list <- cliques(ment_g, min = 3, max = 3)
## Warning in cliques(ment_g, min = 3, max = 3): At igraph_cliquer.c:56 :Edge
## directions are ignored for clique calculations
# Convert to a dataframe and filter down to just revodavid cliques
clq_df <- data.frame(matrix(names(unlist(clq_list)), nrow = length(clq_list), byrow = T))
rev_d <- clq_df %>% filter(X1 == "revodavid" | X2 == "revodavid" | X3 == "revodavid") %>% droplevels()

# Create empty graph and build it up
clq_g_empty <- graph.empty()
clq_g <- clq_g_empty + vertices(unique(unlist(rev_d)))
for(i in 1:dim(rev_d)[1]){
  clq_g <- clq_g + edges(rev_d[i, 1], rev_d[i, 2])
  clq_g <- clq_g + edges(rev_d[i, 2], rev_d[i, 3])
  clq_g <- clq_g + edges(rev_d[i, 1], rev_d[i, 3])}

# Trim graph and plot using `simplify()`
clq_g_trimmed <- as.undirected(simplify(clq_g))
plot(clq_g_trimmed)

# Find the communities
rt_fgc <-  cluster_fast_greedy(as.undirected(rt_g))
rt_info <- cluster_infomap(as.undirected(rt_g))
rt_clust <- cluster_louvain(as.undirected(rt_g))

# Compare all the communities
compare(rt_fgc, rt_clust, method = 'vi')
## [1] 2.144703
compare(rt_info, rt_clust, method = 'vi')
## [1] 1.623552
compare(rt_fgc, rt_info, method = 'vi')
## [1] 2.324274
# Test membership of the same users
fgc_test <- which(names(membership(rt_fgc)) %in% c("bass_analytics", "big_data_flow"))
membership(rt_fgc)[fgc_test]
## bass_analytics  big_data_flow 
##              3              3
info_test <- which(names(membership(rt_info)) %in% c("bass_analytics", "big_data_flow"))
membership(rt_info)[info_test]
## bass_analytics  big_data_flow 
##            102             77
# The crossing() function in igraph will return true if a particular edge crosses communities
# This is useful when we want to see certain vertices that are bridges between communities

# Assign cluster membership to each vertex in rt_g using membership()
V(rt_g)$clust <- membership(rt_clust)

# Assign crossing value to each edge
E(rt_g)$cross <- crossing(rt_clust, rt_g)

# Plot the whole graph (this is probably a mess)
plot(rt_g, vertex.label = NA, edge.arrow.width = .8, edge.arrow.size = 0.2, 
     coords = layout_with_fr(rt_g), margin = 0, vertex.size = 3, 
     vertex.color = V(rt_g)$clust, edge.color = E(rt_g)$cross+1)

# Create a subgraph with just a few communities greater than 50 but less than 90 in size
mid_comm <- as.numeric(names(sizes(rt_clust)[sizes(rt_clust) > 50 & sizes(rt_clust) < 90 ]))
rt_sg <- induced.subgraph(rt_g, V(rt_g)[ clust %in% mid_comm ])

# Plot the subgraph
plot(rt_sg, vertex.label = NA, edge.arrow.width = .8, edge.arrow.size = 0.2, 
     coords = layout_with_fr(rt_sg), margin = 0, vertex.size = 3, 
     vertex.color = V(rt_sg)$clust, edge.color = E(rt_sg)$cross+1)


Chapter 3 - Bike Sharing in Chicago

Creating our graph from raw data:

  • Dataset based on Chicago Divvy bike sharing, from freely available data
    • bike_dat <- read.csv(“/Users/edmundhart/wkspace/courses-case-studies-network-r/datasets/bike2_test3.csv”, stringsAsFactors = F)
    • trip_df <- bike_dat %>% group_by(from_station_id, to_station_id) %>% summarise(weights = n())
    • head(trip_df)
    • trip_g <- graph_from_data_frame(trip_df[, 1:2])
    • E(trip_g)\(weight <- trip_df\)weights
    • gsize(trip_g)
    • gorder(trip_g)
  • Can create a sub-graph and run some initial explorations on the network - will notice that there are many loops (trips where to/from is the same)
    • sg <- induced_subgraph(trip_g, 1:12)
    • plot(sg, vertex.label = NA, edge.arrow.width = .8, edge.arrow.size = 0.6, margin = 0, vertex.size = 6, edge.width = log(E(sg)$weight+2))

Compare Graph Distance vs. Geographic Distance:

  • Graphs do not always reflect the geography well; graph distance may or may not be related to geographic distance
  • Can get graph distances using built in functions
    • farthest_vertices(trip_g_simp)
    • get_diameter(trip_g_simp)
  • Can use geographic coding (lat/lon) to find the geographic distances
    • library(geosphere)
    • st_to <- bike_dat %>% filter(from_station_id == 336 ) %>% sample_n(1) %>% select(from_longitude, from_latitude)
    • st_from <- bike_dat %>% filter(from_station_id == 340 ) %>% sample_n(1) %>% select(from_longitude, from_latitude)
    • farthest_dist <- distm(st_from, st_to, fun = distHaversine)
    • bike_dist <- function(station_1, station_2, divy_bike_df){
    • st1 <- divy_bike_df %>% filter(from_station_id == station_1 ) %>% sample_n(1) %>% select(from_longitude, from_latitude)
    • st2 <- divy_bike_df %>% filter(from_station_id == station_2 ) %>% sample_n(1) %>% select(from_longitude, from_latitude)
    • farthest_dist <- distm(st1, st2, fun = distHaversine)
    • return(farthest_dist)
    • }

Connectivity:

  • Can be measured either for vertex or edges - how many need to be removed to create 2 distinct graphs
    • rand_g <- erdos.renyi.game(10, .4, “gnp”, directed = F)
    • plot(rand_g)
    • vertex_connectivity(rand_g)
    • edge_connectivity(rand_g)
    • min_cut(rand_g, value.only = F) # more information about the connectivity
  • Can then run comparisons between random graphs and bike-sharing graphs
    • nv <- gorder(trip_g_ud)
    • ed <- edge_density(trip_g_ud)
    • graph_vec <- rep(NA, 1000)
    • for(i in 1:1000){ w1 <- erdos.renyi.game(nv, ed, “gnp”, directed = T) ; graph_vec[i]<- edge_connectivity(w1) }

Example code includes:

bike_dat <- readr::read_csv("./RInputFiles/divvy_bike_sample.csv")
## Parsed with column specification:
## cols(
##   tripduration = col_double(),
##   from_station_id = col_double(),
##   from_station_name = col_character(),
##   to_station_id = col_double(),
##   to_station_name = col_character(),
##   usertype = col_character(),
##   gender = col_character(),
##   birthyear = col_double(),
##   from_latitude = col_double(),
##   from_longitude = col_double(),
##   to_latitude = col_double(),
##   to_longitude = col_double(),
##   geo_distance = col_double()
## )
glimpse(bike_dat)
## Observations: 52,800
## Variables: 13
## $ tripduration      <dbl> 295, 533, 1570, 2064, 2257, 296, 412, 948, 8...
## $ from_station_id   <dbl> 49, 165, 25, 300, 85, 174, 75, 45, 85, 99, 3...
## $ from_station_name <chr> "Dearborn St & Monroe St", "Clark St & Wavel...
## $ to_station_id     <dbl> 174, 308, 287, 296, 313, 198, 56, 147, 174, ...
## $ to_station_name   <chr> "Canal St & Madison St", "Seeley Ave & Rosco...
## $ usertype          <chr> "Subscriber", "Subscriber", "Customer", "Cus...
## $ gender            <chr> "Male", "Male", NA, NA, "Male", "Female", "M...
## $ birthyear         <dbl> 1964, 1972, NA, NA, 1963, 1973, 1989, 1965, ...
## $ from_latitude     <dbl> 41.88132, 41.95078, 41.89766, 41.93773, 41.9...
## $ from_longitude    <dbl> -87.62952, -87.65917, -87.62351, -87.64409, ...
## $ to_latitude       <dbl> 41.88209, 41.94340, 41.88032, 41.94011, 41.9...
## $ to_longitude      <dbl> -87.63983, -87.67962, -87.63519, -87.64545, ...
## $ geo_distance      <dbl> 858.9672, 1881.5034, 2159.4804, 287.8546, 30...
# Create trip_df_subs
trip_df_subs <- bike_dat %>% 
  filter(usertype == "Subscriber") %>% 
  group_by(from_station_id, to_station_id) %>% 
  summarise(weights = n())

# Create igraph object
trip_g_subs <- graph_from_data_frame(trip_df_subs[, 1:2])

# Add edge weights
E(trip_g_subs)$weights <- trip_df_subs$weights / sum(trip_df_subs$weights)

# Now work the same code and filter it down to non-subs
trip_df_non_subs <- bike_dat %>% 
  filter(usertype == "Customer") %>% 
  group_by(from_station_id, to_station_id) %>% 
  summarise(weights = n())

# Create igraph object
trip_g_non_subs <- graph_from_data_frame(trip_df_non_subs[, 1:2])

# Add edge weights
E(trip_g_non_subs)$weights <- trip_df_non_subs$weights / sum(trip_df_non_subs$weights)

# Now let's compare these graphs
gsize(trip_g_subs)
## [1] 14679
gsize(trip_g_non_subs)
## [1] 9528
# Create the subgraphs
sg_sub <- induced_subgraph(trip_g_subs, 1:12)
sg_non_sub <- induced_subgraph(trip_g_non_subs, 1:12)

# Plot sg_sub
plot(sg_sub, vertex.size = 20, edge.arrow.width = .8, edge.arrow.size = 0.4, 
     margin = 0, edge.width = E(sg_sub)$weights*10000, main = "Subscribers")

# Plot sg_non_sub
plot(sg_non_sub, vertex.size = 20, edge.arrow.width = .8, edge.arrow.size = 0.4, 
     margin = 0, vertex.size = 10, edge.width = E(sg_non_sub)$weights*10000, 
     main = "Customers")

bike_dist <- function(station_1, station_2, divy_bike_df){ 
   st1 <- divy_bike_df %>% filter(from_station_id == station_1 ) %>% sample_n(1) %>% select(from_longitude, from_latitude)
   st2 <- divy_bike_df %>% filter(from_station_id == station_2 ) %>% sample_n(1) %>% select(from_longitude, from_latitude)
   farthest_dist <- geosphere::distm(st1, st2, fun = geosphere::distHaversine)
   return(farthest_dist)
}


# See the diameter of each graph
get_diameter(trip_g_subs)
## + 7/300 vertices, named, from 54f938b:
## [1] 200 336 267 150 45  31  298
get_diameter(trip_g_non_subs)
## + 6/299 vertices, named, from 55071b9:
## [1] 116 31  25  137 135 281
# Find the farthest vertices
farthest_vertices(trip_g_subs)
## $vertices
## + 2/300 vertices, named, from 54f938b:
## [1] 200 298
## 
## $distance
## [1] 6
farthest_vertices(trip_g_non_subs)
## $vertices
## + 2/299 vertices, named, from 55071b9:
## [1] 116 281
## 
## $distance
## [1] 5
# See how far apart each one is and compare the distances
bike_dist(200, 298, bike_dat)
##          [,1]
## [1,] 17078.31
bike_dist(116, 281, bike_dat)
##          [,1]
## [1,] 7465.656
# Create trip_df
trip_df <- bike_dat %>% 
  group_by(from_station_id, to_station_id) %>% 
  summarise(weights = n())

# Create igraph object
trip_g_df <- graph_from_data_frame(trip_df[, 1:2])

# Add edge weights
E(trip_g_df)$weights <- trip_df$weights / sum(trip_df$weights)


trip_g_simp <- simplify(trip_g_df, remove.multiple=FALSE)
trip_g_simp
## IGRAPH 553a886 DN-- 300 18773 -- 
## + attr: name (v/c), weights (e/n)
## + edges from 553a886 (vertex names):
##  [1] 5->14  5->16  5->25  5->29  5->33  5->35  5->36  5->37  5->43  5->49 
## [11] 5->51  5->52  5->53  5->55  5->59  5->66  5->68  5->72  5->74  5->75 
## [21] 5->76  5->81  5->85  5->90  5->92  5->97  5->98  5->99  5->100 5->108
## [31] 5->110 5->111 5->117 5->120 5->128 5->134 5->135 5->137 5->140 5->141
## [41] 5->144 5->146 5->148 5->149 5->168 5->169 5->171 5->174 5->175 5->176
## [51] 5->177 5->178 5->181 5->191 5->192 5->193 5->194 5->198 5->210 5->214
## [61] 5->218 5->227 5->233 5->237 5->255 5->264 5->268 5->273 5->277 5->291
## [71] 5->309 5->321 5->333 5->335 5->341
## + ... omitted several edges
# Find the degree distribution
trip_out <- degree(trip_g_simp, mode = "out")
trip_in <- degree(trip_g_simp, mode = "in")

# Create a data frame for easier filtering 
trip_deg <- data.frame(cbind(trip_out, trip_in))
trip_deg$station_id <- names(trip_out)
trip_deg_adj <- trip_deg %>% mutate(ratio = trip_out / trip_in)

# Filter out rarely traveled to stations
trip_deg_filter <- trip_deg_adj %>% filter(trip_out > 10) %>% filter(trip_in > 10) 

# Plot histogram
hist(trip_deg_filter$ratio)

# See which stations were the most skewed using which.min() and which.max()
trip_deg_filter %>% slice(which.min(ratio))
##   trip_out trip_in station_id     ratio
## 1       14      24        207 0.5833333
trip_deg_filter %>% slice(which.max(ratio))
##   trip_out trip_in station_id    ratio
## 1       19      11        135 1.727273
# If the weights are the same across all stations, then an unweighted degree ratio would work
# But if we want to know how many bikes are actually flowing, we need to consider weights
# The weighted analog to degree distribution is strength
# We can calculate this with the strength() function, which presents a weighted degree distribution based on the weight attribute of a graph's edges

# Calculate the weighted in and out degrees
trip_out_w <- strength(trip_g_simp, mode = "out")
trip_in_w <- strength(trip_g_simp, mode = "in")

# Create a data frame for easier filtering 
trip_deg_w <- data.frame(cbind(trip_out_w, trip_in_w))
trip_deg_w$station_id <- names(trip_out_w)
trip_deg_w_adj <- trip_deg_w %>% mutate(ratio = trip_out_w / trip_in_w)

# Filter out rarely traveled to stations
trip_deg_w_filter <- trip_deg_w_adj %>% filter(trip_out_w > 10) %>% filter(trip_in_w > 10) 

# Plot histogram of ratio
hist(trip_deg_w_filter$ratio)

# See which stations were the most skewed using which.min() and which.max()
trip_deg_w_filter %>% slice(which.min(ratio))
##   trip_out_w trip_in_w station_id     ratio
## 1         14        24        207 0.5833333
trip_deg_w_filter %>% slice(which.max(ratio))
##   trip_out_w trip_in_w station_id    ratio
## 1         19        11        135 1.727273
latlong <- data.frame(from_longitude=c(-87.656495, -87.660996, -87.6554864, -87.642746, -87.67328, -87.661535, -87.623727, -87.668745, -87.65103, -87.666507, -87.666611), 
                      from_latitude=c(41.858166, 41.869417, 41.8694821, 41.880422, 41.87501, 41.857556, 41.864059, 41.857901, 41.871737, 41.865234, 41.891072)
                      )

# Create a sub graph of the least traveled graph 275
g275 <- make_ego_graph(trip_g_simp,  1, nodes = "275", mode= "out")[[1]]

# Plot graph with geographic coordinates
plot(g275, layout = as.matrix(latlong), vertex.label.color = "blue", vertex.label.cex = .6,
     edge.color = 'black', vertex.size = 15, edge.arrow.size = .1,
     edge.width = E(g275)$weight, main = "Lat/Lon Layout")

# Plot graph without geographic coordinates
plot(g275, vertex.label.color = "blue", vertex.label.cex = .6,
     edge.color = 'black', vertex.size = 15, edge.arrow.size = .1,
     edge.width = E(g275)$weight,
     main = "Default Layout")

# Eigen centrality weighted
ec_weight <- eigen_centrality(trip_g_simp, directed = T, weights = NULL)

# Eigen centrality unweighted
ec_unweight <- eigen_centrality(trip_g_simp, directed = T, weights = NA)

# Closeness weighted
close_weight <- closeness(trip_g_simp, weights = NULL)

# Closeness unweighted
close_unweight <- closeness(trip_g_simp, weights = NA)

# Output nicely with cbind()
cbind(c(
  names(V(trip_g_simp))[which.min(ec_weight$vector)],
  names(V(trip_g_simp))[which.min(close_weight)],
  names(V(trip_g_simp))[which.min(ec_unweight$vector)],
  names(V(trip_g_simp))[which.min(close_unweight)]
  ), c("Weighted Eigen Centrality", "Weighted Closeness", "Unweighted Eigen Centrality", "Unweighted Closeness")
)
##      [,1]  [,2]                         
## [1,] "204" "Weighted Eigen Centrality"  
## [2,] "336" "Weighted Closeness"         
## [3,] "204" "Unweighted Eigen Centrality"
## [4,] "336" "Unweighted Closeness"
trip_g_ud <- as.undirected(trip_g_simp)
trip_g_ud
## IGRAPH 557213f UN-- 300 12972 -- 
## + attr: name (v/c)
## + edges from 557213f (vertex names):
##  [1] 5 --14 14--15 5 --16 13--16 15--16 16--17 5 --19 14--19 15--19 5 --20
## [11] 13--20 16--20 17--20 14--21 16--21 17--21 19--21 14--22 15--22 19--22
## [21] 21--22 5 --23 17--23 20--23 5 --24 16--24 17--24 23--24 5 --25 13--25
## [31] 16--25 17--25 20--25 21--25 22--25 23--25 24--25 5 --26 17--26 19--26
## [41] 20--26 24--26 25--26 13--27 16--27 20--27 24--27 26--27 17--28 20--28
## [51] 5 --29 16--29 17--29 20--29 24--29 25--29 26--29 16--30 17--30 19--30
## [61] 20--30 22--30 29--30 13--31 14--31 16--31 17--31 20--31 21--31 22--31
## [71] 23--31 24--31 25--31 26--31 28--31 30--31 15--32 19--32 21--32 22--32
## + ... omitted several edges
# Find the minimum number of cuts using min_cut()
ud_cut <- min_cut(trip_g_ud, value.only = FALSE)

# Print the vertex with the minimum number of cuts
print(ud_cut$partition1)
## + 1/300 vertex, named, from 557213f:
## [1] 281
# Make an ego graph
g<- make_ego_graph(trip_g_ud, 1, nodes = "281")[[1]]
plot(g, edge.color = 'black', edge.arrow.size = .1)

# Print the value
print(ud_cut$value)
## [1] 5
# Print cut object
print(ud_cut$cut)
## + 5/12972 edges from 557213f (vertex names):
## [1] 71 --281 135--281 167--281 203--281 281--305
far_stations <- c("231", "321")
close_stations <- c("231", "213")

# Compare the output of close and far vertices
stMincuts(trip_g_simp, far_stations[1], far_stations[2])$value
## [1] 54
stMincuts(trip_g_simp, close_stations[1], close_stations[2])$value
## [1] 49
# Find the actual value
clust_coef <- transitivity(trip_g_simp, type = "global")

# Get randomization parameters using gorder() and edge_density()
nv <- gorder(trip_g_simp)
ed <- edge_density(trip_g_simp)

# Create an empty vector to hold output of 300 simulations
graph_vec <- rep(NA, 300)

# Calculate clustering for random graphs
for(i in 1:300){
  graph_vec[i]<- transitivity(erdos.renyi.game(nv, ed, "gnp", directed = T), type = "global")
}

# Plot a histogram of the simulated values
hist(graph_vec, xlim = c(.35, .6), main = "Unweighted clustering randomization")

# Add a line with the true value
abline(v = clust_coef, col = "red")

# Find the mean local weighted clustering coeffecient
m_clust <- mean(transitivity(trip_g_simp, type = "weighted"))
nv <- gorder(trip_g_simp)
ed <- edge_density(trip_g_simp)
graph_vec <- rep(NA, 100)

for(i in 1:100){
  g_temp <- erdos.renyi.game(nv, ed, "gnp", directed = T)
  # Sample existing weights and add them to the random graph
  E(g_temp)$weight <- sample(x = E(trip_g_simp)$weights, size = gsize(g_temp), replace = TRUE)
  graph_vec[i]<- mean(transitivity(g_temp, type = "weighted"))
}

# Plot a histogram of the simulated values
hist(graph_vec, xlim = c(.35, .7), main = "Unweighted clustering randomization")

# Add a line with the true value
abline(v = m_clust ,col = "red")


Chapter 4 - Other Ways to Visualize Graph Data

Other packages for plotting graphs:

  • Base plotting in igraph is good for quick visualizations, but other libraries can make great plots simply
  • Can use the ggplot syntax with ggnet, for example
    • library(ggnetwork)
    • library(igraph)
    • library(GGally)
    • library(intergraph)
  • The basic igraph plotting is as follows
    • rand_g <- erdos.renyi.game(30, .15, “gnp”, directed = F)
    • rand_g <- simplify(rand_g)
    • plot(rand_g)
  • The basic ggnet plotting requires use of asNetwork on the core graph
    • net <- asNetwork(rand_g)
    • ggnet2(net)
  • The basic ggnetwork plotting is more similar to ggplot2
    • gn <- ggnetwork(rand_g)
    • g <- ggplot(gn, aes(x = x, y = y, xend = xend, yend = yend)) + geom_edges() + geom_nodes() + theme_blank()
    • plot(g)
  • Where an igraph might require voluminous code for extensions, the ggnet or ggnetwork can be extended in a much simpler manner
    • ggnet2(net, node.size = “cent”, node.color = “comm”, edge.size = .8, color.legend = “Community Membership”, color.palette = “Spectral”, edge.color = c(“color”, “gray88”), size.cut = T, size.legend = “Centrality”)
    • g <- ggplot(gn, aes(x = x, y = y, xend = xend, yend = yend)) + geom_edges(aes(color = as.factor(comm))) + geom_nodes(aes(color = as.factor(comm), size = cent)) + theme_blank() + guides(color = guide_legend(title = “Community”), size = guide_legend(title = “Centrality”))
    • plot(g)

Interactive visualizations:

  • Can use R to create interactive javascript plots for outputs to html or similar
  • Examples can be built on a simple random graph
    • library(ggiraph)
    • library(htmlwidgets)
    • library(networkD3)
    • rand_g <- erdos.renyi.game(30, .12, “gnp”, directed = F)
    • rand_g <- simplify(rand_g)
    • V(rand_g)$cent <- betweenness(rand_g)
  • Can plot using ggplot2 and ggiraph
    • g <- ggplot(ggnetwork(rand_g), aes(x = x, y = y, xend = xend, yend = yend)) + geom_edges(color = “black”) + geom_nodes(aes(size = cent)) + theme_blank() + guides(size = guide_legend(title = “Centrality”))
    • my_gg <- g + geom_point_interactive(aes(tooltip = round(cent, 2)), size = 2) # Create ggiraph object
    • ggiraph(code = print(my_gg)) # Display ggiraph object
  • Can also customize the ggiraph call using valid CSS, for example using
    • hover_css = “cursor:pointer;fill:red;stroke:red;r:5pt”
    • data_id = round(cent, 2)), size = 2)
    • ggiraph(code = print(my_gg), hover_css = hover_css, tooltip_offx = 10, tooltip_offy = -10)
  • Can also plot with networkD3 which is easy to use but does not allow for much customization
    • nd3 <- igraph_to_networkD3(rand_g)
    • simpleNetwork(nd3$links)
  • Can add complexity to the D3 plot by specifying information about the nodes and edges to be plotted insinde forceNetwork()
    • nd3\(nodes\)group = V(rand_g)$comm
    • nd3\(nodes\)cent = V(rand_g)$cent
    • forceNetwork(Links = nd3\(links, Nodes = nd3\)nodes, Source = ‘source’, Target = ‘target’, NodeID = ‘name’, Group = ‘group’, Nodesize = ‘cent’, legend = T, fontSize = 20)

Alternative visualizations:

  • Hairball plots are designed to maximize spacing between vertices, but as plots get large the information is hard if not impossible to interpret
  • One potential solution is the hive plot
    • library(HiveR)
    • rand_g <- erdos.renyi.game(18, .3, “gnp”, directed = T)
    • plot(rand_g, vertex.size = 7) # standard igraph plot
    • rand_g_df <- as.data.frame(get.edgelist(rand_g))
    • rand_g_df$weight <- 1
    • rand_hive <- edge2HPD(edge_df = rand_g_df)
    • rand_hive\(nodes\)axis <- sort(rep(1:3, 6))
    • rand_hive\(nodes\)radius <- as.double(rep(1:6, 3))
    • plotHive(rand_hive, method=“abs”, bkgnd=“white”)
  • Can modify hive plots by way of either nodes or edges
    • Setting location of each node

    • rand_hive\(nodes\)axis <- sort(rep(1:3, 6))
    • rand_hive\(nodes\)radius <- as.double(rep(1:6, 3))
    • Add weights to each edge

    • rand_hive\(edges\)weight <- as.double(rpois(length(rand_hive\(edges\)weight), 5))
    • Add color based on edge origination

    • rand_hive\(edges\)color[rand_hive\(edges\)id1 %in% 1:6] <- ‘red’
    • rand_hive\(edges\)color[rand_hive\(edges\)id1 %in% 7:12] <- ‘blue’
    • rand_hive\(edges\)color[rand_hive\(edges\)id1 %in% 13:18] <- ‘green’
    • Plot

    • plotHive(rand_hive, method = “abs”, bkgnd = “white”)
  • Another alternative is the biofabric plot
    • Create random graph

    • rand_g <- erdos.renyi.game(10, .3, “gnp”, directed = F)
    • rand_g <- simplify(rand_g)
    • Add names to vertices

    • V(rand_g)$name <- LETTERS[1:length(V(rand_g))]
    • Create biofabric plot

    • biofbc <- bioFabric(rand_g)
    • bioFabric_htmlwidget(biofbc)

Example code includes:

verts <- c(1185, 3246, 1684, 3634, 3870, 188, 2172, 3669, 2267, 1877, 3931, 1862, 2783, 2351, 423, 3692, 1010, 173, 1345, 3913, 3646, 2839, 2624, 4072, 2685, 2901, 2227, 2431, 1183, 602, 3937, 3688, 2823, 3250, 101, 1951, 3097, 884, 1299, 945, 583, 1691, 1687, 1504, 622, 566, 949, 1897, 1083, 3491, 187, 1799, 3249, 496, 2280, 840, 519, 3060, 4115, 1520, 2700, 385, 1558, 1113, 3303, 1818, 3283, 3291, 3218, 1781, 3055, 2547, 2874, 3, 1923, 890, 1536, 2477, 1422, 449, 984, 2697, 1686, 3181, 415, 1754, 3972, 3600, 3573, 706, 527, 2631, 1383, 2644, 1290, 756, 3147, 377, 4109, 2056, 2411, 1337, 1963, 3833, 1939, 4030, 4111, 2442, 1647, 590, 3749, 1208, 244, 3796, 2886, 570, 2199, 3818, 2342, 1618, 2591, 1279, 1230, 878, 1476, 3930, 616, 364, 567, 2753, 2470, 3554, 2683, 2938, 2077, 2629, 3273, 3131, 3900, 1749, 1240, 1629, 42, 731, 3350, 919, 950, 305, 976, 2906, 3363, 1974, 1539, 978, 441, 1546, 4110, 860, 1762, 864, 1989, 1401, 2572, 1482, 1406, 2110, 2926, 874, 1631, 1050, 2488, 726, 3408, 2946, 2636, 2437, 1468, 2089, 3447, 2292, 3308, 1231, 2788, 1043, 2339, 1893, 3935, 2220, 3589, 3544, 1077, 1263, 4114, 2434, 3679, 1831, 1596, 2585, 598, 2246, 936, 3770, 2355, 2017, 1576, 3445, 1425, 1128, 668, 674, 1884, 989, 845, 2634, 4068, 2736, 1374, 3922, 3202, 3583, 1102, 3746, 2838, 2674, 206, 3966, 1860, 2180, 2717, 3562, 2405, 1666, 2107, 228, 1014, 1543, 768, 3229, 594, 3117, 2121, 2568, 666, 2454, 1209, 2807, 1545, 3753, 3744, 2812, 995, 858, 2293, 1034, 2053, 3034, 650, 1562, 1821, 3351, 3572, 3402, 2600, 3663, 1991, 2222, 1296, 1338, 78, 1936, 3352, 25, 278, 632, 2962, 2826, 3734, 1792, 286, 2491, 2912, 4028, 1522, 863, 223, 1518, 249, 866, 210, 2567, 1140, 386, 276, 3368, 2885, 3122, 3754, 396, 379, 3051, 2996, 36, 2973, 4106, 2404, 1834, 3920, 32, 1724, 1876, 1484, 1769, 2715, 211, 1350, 3054, 3178, 904, 1346, 3256, 3243, 1124, 559, 2672, 394, 128, 3790, 133, 1283, 3468, 3934, 1085, 2794, 3157, 1190, 1864, 2638, 2426, 2435, 3696, 1567, 451, 1987, 850, 1836, 1397, 3710, 1465, 865, 2350, 515, 3645, 1940, 614, 2341, 3711, 2516, 3914, 1216, 3140, 541, 725, 3369, 1157, 1364, 2943, 3947, 67, 1525, 1812, 1582, 1285, 4117, 1705, 1999, 3608, 2899, 782, 1155, 3632, 2187, 2844, 1393, 2873, 2008, 3412, 692, 1053, 355, 785, 3643, 1105, 2706, 2927, 393, 893, 1007, 4021, 439, 3687, 3667, 510, 3365, 2141, 1469, 1671, 2623, 307, 1259, 2526, 1176, 3083, 798, 1845, 1023, 712, 3520, 1191, 1771, 104, 2025, 2382, 2204, 3784, 3292, 2313, 1119, 1433, 593, 3182, 3516, 2079, 1215, 3673, 3831, 2257, 399, 1793, 366, 3690, 1041, 2147, 2690, 609, 3184, 2603, 2793, 540, 1315, 2471, 1922, 3792, 882, 214, 867, 3261, 3816, 2737, 3990, 457, 3566, 1595, 1697, 605, 2138, 990, 841, 2524, 1033, 2958, 343, 2998, 1559, 2756, 2414, 1620, 2285, 2, 791, 2566, 783, 2961, 1120, 2500, 3390, 421, 464, 2463, 4056, 3029, 3525, 256, 1668, 2544, 316, 3598, 917, 180, 2485, 2848, 1280, 1326, 1039, 290, 1321, 644)
verts <- c(verts, 1937, 1820, 3733, 1232, 1677, 298, 3102, 1427, 2653, 619, 1639, 2774, 226, 2934, 1084, 1312, 1123, 135, 1865, 2440, 3245, 92, 3551, 1088, 3370, 2467, 1604, 2928, 142, 2648, 1250, 2970, 1918, 983, 2866, 328, 2976, 3653, 2692, 4099, 291, 3819, 2864, 1375, 1169, 732, 2031, 3166, 1888, 2092, 2372, 1887, 1816, 58, 170, 3306, 3903, 715, 2312, 2323, 1404, 3824, 1942, 3142, 1964, 3214, 2084, 1502, 3366, 2513, 1464, 66, 2007, 1735, 3109, 2876, 3021, 1301, 3089, 535, 996, 3916, 3451, 2057, 1858, 215, 3417, 424, 312, 3103, 1791, 1189, 3149, 113, 835, 2415, 794, 3636, 612, 2816, 514, 2889, 1162, 1313, 2210, 339, 3850, 3481, 2047, 2739, 3124, 2643, 3428, 155, 3161, 3027, 2711, 1317, 148, 1273, 956, 2969, 1265, 1063, 3899, 3945, 1597, 2543, 363, 767, 3322, 2618, 2850, 1454, 2066, 2778, 3534, 1339, 314, 2174, 2589, 297, 3932, 2132, 2612, 3180, 1649, 1966, 2552, 3581, 3148, 196, 1741, 1213, 2924, 3936, 406, 3631, 813, 259, 3230, 543, 2233, 599, 70, 1797, 3607, 975, 1448, 2022, 2777, 696, 1581, 1542, 2523, 2457, 2857, 3046, 3272, 1891, 3681, 586, 1644, 871, 137, 2176, 1849, 480, 972, 1996, 565, 330, 1466, 1217, 2888, 889, 80, 3487, 1143, 2157, 3594, 3747, 634, 1463, 2150, 1775, 2247, 2484, 1658, 1309, 24, 13, 3383, 367, 1423, 2439, 2522, 3637, 2064, 3639, 4046, 2078, 3676, 3506, 1413, 2964, 2192, 3130, 4078, 1069, 2720, 3344, 1090, 5, 3848, 501, 167, 3915, 3787, 4049, 3986, 233, 2343, 3196, 3918, 4063, 537, 242, 3809, 1648, 1662, 2986, 124, 685, 1726, 4087, 1932, 3999, 1910, 484, 489, 1382, 2289, 2189, 3067, 2722, 2262, 2702, 429, 839, 1109, 1361, 2123, 4058, 3959, 2735, 52, 2183, 2707, 1538, 678, 63, 943, 3047, 3108, 1806, 730, 1628, 2664, 1355, 345, 932, 1201, 861, 3861, 1214, 403, 156, 3429, 3210, 3355, 1583, 2479, 3508, 164, 2299, 3320, 2923, 2562, 460, 4013, 417, 1947, 1853, 2272, 1027, 1997, 3266, 2449, 250, 1486, 177, 1118, 3644, 14, 2538, 3836, 2368, 3349, 1879, 2310, 3413, 4032, 319, 3155, 2413, 3842, 3724, 1802, 3319, 2940, 31, 773, 426, 1067, 2374, 3240, 2335, 4010, 3398, 3096, 392, 245, 2898, 4026, 138, 2109, 1526, 2011, 881, 512, 372, 1650, 3373, 3659, 552, 2474, 1712, 3786, 2185, 43, 3406, 2890, 3504, 348, 2982, 2186, 481, 4018, 3048, 1360, 962, 838, 720, 1826, 4011, 2161, 1763, 2617, 2447, 65, 1227, 3938, 2569, 3662, 1746, 2742, 4020, 2148, 1643, 2450, 4093, 3905, 230, 3401, 168, 2779, 1847, 1006, 3074, 1894, 1702, 1229, 3704, 2586, 3595, 1163, 3661, 2230, 3236, 1111, 1770, 438, 2504, 2828, 651, 2456, 1900, 3050, 506, 1674, 3477, 2766, 76, 3606, 3630, 1237, 3617, 295, 3512, 1286, 3623, 3495, 964, 3407, 494, 3629, 140, 1178, 3045, 2041, 194, 3852, 3800, 1605, 1420, 1968, 442, 3570, 1796, 1729, 369, 2401, 1507, 2462, 145, 2580, 848, 4043, 3443, 2979, 22, 3727, 1316, 1437, 3450, 3590, 3465, 3188, 2373, 432, 3425, 3449, 1356, 273, 700, 1789, 1251, 1767, 3998, 2005, 1222, 2214, 340)

# Create subgraph of rt_g
rt_samp <- induced_subgraph(rt_g, verts)

# Convert from igraph using asNetwork()
net <- intergraph::asNetwork(rt_samp)

# Plot using igraph
plot(rt_samp, vertex.label = NA, edge.arrow.size = 0.2, edge.size = 0.5, 
     vertex.color = "black", vertex.size = 1
     )

# Plot using ggnet2
GGally::ggnet2(net, node.size = 1, node.color = "black", edge.size = .4)

# Raw plot of rt_samp using ggnetwork()
library(ggnetwork)
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
ggplot(ggnetwork(rt_samp, arrow.gap = .01) , aes(x = x, y = y, xend = xend, yend = yend)) + 
    geom_edges(arrow = arrow(length = unit(6, "pt"), type = "closed"), color = "black") +
    geom_nodes(size = 4) 
## Loading required package: sna
## Loading required package: statnet.common
## 
## Attaching package: 'statnet.common'
## The following object is masked from 'package:base':
## 
##     order
## Loading required package: network
## network: Classes for Relational Data
## Version 1.13.0.1 created on 2015-08-31.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##                     Mark S. Handcock, University of California -- Los Angeles
##                     David R. Hunter, Penn State University
##                     Martina Morris, University of Washington
##                     Skye Bender-deMoll, University of Washington
##  For citation information, type citation("network").
##  Type help("network-package") to get started.
## 
## Attaching package: 'network'
## The following objects are masked from 'package:igraph':
## 
##     %c%, %s%, add.edges, add.vertices, delete.edges,
##     delete.vertices, get.edge.attribute, get.edges,
##     get.vertex.attribute, is.bipartite, is.directed,
##     list.edge.attributes, list.vertex.attributes,
##     set.edge.attribute, set.vertex.attribute
## sna: Tools for Social Network Analysis
## Version 2.4 created on 2016-07-23.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##  For citation information, type citation("sna").
##  Type help(package="sna") to get started.
## 
## Attaching package: 'sna'
## The following objects are masked from 'package:igraph':
## 
##     betweenness, bonpow, closeness, components, degree,
##     dyad.census, evcent, hierarchy, is.connected, neighborhood,
##     triad.census

# Prettier plot of rt_samp using ggnetwork()
ggplot(ggnetwork(rt_samp, arrow.gap = .01),aes(x = x, y = y, xend = xend, yend = yend)) + 
    geom_edges(arrow = arrow(length = unit(6, "pt"), type = "closed"), color = "black", curvature = .2) +
    geom_nodes(size = 4) + theme_blank()

# NEED TO FIX!
rt_keys <- sort(table(vertex_attr(rt_g)$clust), decreasing=TRUE)
# rt_drops <- names(rt_keys)[11:length(rt_keys)]
# vt_drops <- which(vertex_attr(rt_g)$clust %in% rt_drops)
# rt_use <- delete_vertices(rt_g, vt_drops)
rt_use <- induced_subgraph(rt_g, which(V(rt_g)$clust %in% names(rt_keys[1:10])))

# Convert to a network object
net <- intergraph::asNetwork(rt_use)
ggnet2(net, node.size = "cent", node.color = "clust", edge.size = .1, 
       color.legend = "Community Membership", color.palette = "Spectral"
       )

# Now remove the centrality legend by setting size to false in the guide() function
ggnet2(net, node.size = "cent", node.color = "clust", edge.size = .1, 
       color.legend = "Community Membership", color.palette = "Spectral"
       ) + 
    guides( size = FALSE)

# Add edge colors
ggnet2(net, node.size = "cent", node.color = "clust", edge.size = .1, 
       color.legend = "Community Membership", color.palette = "Spectral", 
       edge.color = c("color", "gray88")) +
  guides( size = FALSE)

# NEED TO CREATE rt_g_smaller!
# Basic plot where we set parameters for the plots using geom_edegs() and geom_nodes()
# ggplot(ggnetwork(rt_g_smaller, arrow.gap = .01), aes(x = x, y = y, xend = xend, yend = yend)) + 
#   geom_edges(arrow = arrow(length = unit(6, "pt"), type = "closed"), curvature = .2, color = "black") + 
#   geom_nodes(size = 4, aes(color = comm)) + 
#   theme_blank()

# Added guide legend, changed line colors, added size 
# ggplot(ggnetwork(rt_g_smaller, arrow.gap = .01), aes(x = x, y = y, xend = xend, yend = yend)) + 
#   geom_edges(arrow = arrow(length = unit(6, "pt"), type = "closed"), curvature = .2, lwd = .3, aes(color=comm)) +
#   geom_nodes(aes(color = comm, size = cent)) + 
#   theme_blank() +  
#   guides(color = guide_legend(title = "Community"), size = guide_legend(title = "Centrality"))


# NEED TO FIX!
# Add betweenness centrality using betweenness()
V(trip_g_simp)$cent <- igraph::betweenness(trip_g_simp)

# Create a ggplot object with ggnetwork to render using ggiraph
g <- ggplot(ggnetwork(trip_g_simp, arrow.gap = .01), aes(x = x, y = y, xend = xend, yend = yend)) + 
    geom_edges(color = "black") + 
    geom_nodes(aes(size = cent)) + 
    theme_blank() 
plot(g)

# Create ggiraph object and assign the tooltip to be interactive
my_gg <- g + ggiraph::geom_point_interactive(aes(tooltip = round(cent, 2), 
                                                 data_id = round(cent, 2)
                                                 ), size = 2
                                             ) 

# Define some hover css so the cursor turns red
hover_css = "cursor:pointer;fill:red;stroke:red;r:3pt"
# ggiraph::ggiraph(code = print(my_gg), hover_css = hover_css, tooltip_offx = 10, tooltip_offy = -10)


# Add community membership as a vertex attribute using the cluster_walktrap algorithm
V(rt_g)$comm <- membership(cluster_walktrap(rt_g))

# Create an induced_subgraph
rt_sub_g <- induced_subgraph(rt_g, which(V(rt_g)$comm %in% 10:13))

# Plot to see what it looks like without an interactive plot using ggnetwork
ggplot(ggnetwork(rt_sub_g, arrow.gap = .01), aes(x = x, y = y, xend = xend, yend = yend)) + 
    geom_edges(color = "black") + 
    geom_nodes(aes(color = as.factor(comm))) + 
    theme_blank()  

# Convert to a networkD3 object
# nd3 <- igraph_to_networkD3(rt_sub_g)

# Assign grouping factor as community membership
# nd3$nodes$group = V(rt_sub_g)$comm

# Render your D3.js graph
# forceNetwork(Links = nd3$links, Nodes = nd3$nodes, Source = 'source', 
#              Target = 'target', NodeID = 'name', Group = 'group', legend = T, fontSize = 20
#              )

# Convert  trip_df to hive object using edge2HPD()
# bike_hive <- edge2HPD(edge_df =  as.data.frame(trip_df))

# Assign to trip_df edgecolor using our custom function
# trip_df$edgecolor <- dist_gradient(trip_df$geodist)

# Calculate centrality with betweenness()
# bike_cent <- betweenness(trip_g)

# Add axis and radius based on longitude and radius
# bike_hive$nodes$radius<- ifelse(bike_cent > 0, bike_cent, runif(1000, 0, 3))

# Set axis as integers and axis colors to black
# bike_hive$nodes$axis <- as.integer(dist_stations$axis)
# bike_hive$axis.cols <- rep("black", 3)

# Set the edge colors to a heatmap based on trip_df$edgecolor
# bike_hive$edges$color <- trip_df$edgecolor
# plotHive(bike_hive, method = "norm", bkgnd = "white")


# Add community membership as a vertex attribute
V(rt_g)$comm <- membership(cluster_walktrap(rt_g))

# Create a subgraph
rt_sub_g <- induced_subgraph(rt_g, which(V(rt_g)$comm %in% 10:15))

# Plot to see what it looks like without an interactive plot
ggplot(ggnetwork(rt_sub_g, arrow.gap = .01), aes(x = x, y = y, xend = xend, yend = yend)) + 
    geom_edges(color = "black") + 
    geom_nodes(aes(color = as.factor(comm)))+ theme_blank() +
    theme(legend.position = "none")

# Make a Biofabric plot htmlwidget
# rt_bf <- bioFabric(rt_sub_g)
# bioFabric_htmlwidget(rt_bf)


# Create a dataframe of start and end latitude and longitude and add weights
# ll_to_plot <- bike_dat %>% group_by(from_station_id, to_station_id, from_latitude, 
#                                     from_longitude, to_latitude, to_longitude, usertype
#                                     ) %>% 
#     summarise(weight = n())

# Create a base map with station points with ggmap()
# ggmap(chicago) + 
#     geom_segment(data = ll_to_plot, aes(x = from_longitude, y = from_latitude, 
#                                         xend = to_longitude, yend = to_latitude, 
#                                         colour = usertype, size = weight
#                                         ), alpha = .5
#                )

Fundamentals of Bayesian Analysis in R:

Chapter 1 - What is Bayesian Analysis?

Introduction:

  • British team spearheaded by Turing found ways to decrypt German communications in 1941
    • Key to Turing’s success was the use of Bayesian methods, which wree not very widely used
  • Bayesian inference is “A method for figuring out unobservable quantities given known facts that uses probability to describe the uncertainty over what the values of the unknown quantities could be”
    • The unknown was the configuration of the wheels in the encryption machine - British already knew what a given wheel configuration would produce
    • Turing worked backwards to figure out the probable configuration of the wheels from the messages that he had received
  • Bayesian analysis is flexible and can be problem-specific and customized to a specific dataset and analysis need

Bayesian data analysis - named for Thomas Bayes from the early-mid 1700s:

  • Bayesian data analysis is about probabilistic inference for learning from data and drawing conclusions
  • For this specific course, probability will be a statement about the certainty (p=1 means certain yes, p=0 means certain no), though there are other definitions that can be used
    • Probability need not be only about yes/no statements, and can be associated to distributions such as “amount of rainfall tomorrow”
    • “The role of probability distributions in Bayesian data analysis is to represent uncertainty, and the role of Bayesian inference is to update probability distributions to reflect what has been learned from data.”
  • Example of using a Bayesian approach to patients
    • prop_model(data) is a function that has been created to plot out probabilities vs. p
    • The data is a vector of successes and failures represented by 1s and 0s
    • There is an unknown underlying proportion of success
    • If data point is a success is only affected by this proportion
    • Prior to seeing any data, any underlying proportion of success is equally likely
    • The result is a probability distribution that represents what the model knows about the underlying proportion of success

Samples and posterior samples:

  • Prior probability distribution is a distribution PRIOR to updating with some data; for example, equally probable that p falls anywhere between 0 and 1
  • Posterior probability distribution is a distribution AFTER updating with what is learned by seeing some data
  • Can be valuable to have a vector of potential outcomes, appropriately weighted by the likelihood of each of the outcomes

Chapter wrap-up:

  • Can draw conclusions about the probabilities based on even a small sample of data observed
  • Next chapters will cover mechanics of Bayesian inference in more detail

Example code includes:

prop_model <- function(data = c(), prior_prop = c(1, 1), n_draws = 10000) {
    data <- as.logical(data)
    proportion_success <- c(0, seq(0, 1, length.out = 100), 1)
    data_indices <- round(seq(0, length(data), length.out = min(length(data) + 1, 20)))

    post_curves <- map_dfr(data_indices, function(i) {
        value <- ifelse(i == 0, "Prior", ifelse(data[i], "Success", "Failure"))
        label <- paste0("n=", i)
        probability <- dbeta(proportion_success, prior_prop[1] + sum(data[seq_len(i)]), 
                             prior_prop[2] + sum(!data[seq_len(i)])
                             )
        probability <- probability / max(probability)
        data_frame(value, label, proportion_success, probability)
        }
    )
    post_curves$label <- fct_rev(factor(post_curves$label, levels =  paste0("n=", data_indices )))
    post_curves$value <- factor(post_curves$value, levels = c("Prior", "Success", "Failure"))
  
    p <- ggplot(post_curves, aes(x = proportion_success, y = label, height = probability, fill = value)) +
        ggridges::geom_density_ridges(stat="identity", color = "white", 
                                      alpha = 0.8, panel_scaling = TRUE, size = 1
                                      ) +
        scale_y_discrete("", expand = c(0.01, 0)) +
        scale_x_continuous("Underlying proportion of success") +
        scale_fill_manual(values = hcl(120 * 2:0 + 15, 100, 65), name = "", 
                          drop = FALSE, labels =  c("Prior   ", "Success   ", "Failure   ")
                          ) +
        #ggtitle(paste0("Binomial model - Data: ", sum(data),  " successes, " , sum(!data), " failures"))  +
        theme_light(base_size = 18) +
        theme(legend.position = "top")
    print(p)
  
    invisible(rbeta(n_draws, prior_prop[1] + sum(data), prior_prop[2] + sum(!data)))
}


# Define data and run prop_model
data = c(1, 0, 0, 1)
prop_model(data)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.

# Define data and run prop_model
data = c(1, 0, 0, 1)
prop_model(data)

data = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
posterior <- prop_model(data)

head(posterior)
## [1] 0.1036182 0.1555942 0.1115889 0.1805051 0.2815857 0.3078648
hist(posterior, breaks = 30, xlim = c(0, 1), col = "palegreen4")

# Get some more information about posterior
median(posterior)
## [1] 0.1882898
quantile(posterior, c(0.05, 0.95))
##         5%        95% 
## 0.06031634 0.38918378
sum(posterior > 0.07) / length(posterior)
## [1] 0.9263

Chapter 2 - How Does Bayesian Inference Work?

Parts needed for Bayesian inference:

  • Bayesian inference requires priors (what is known before seeing data), generative model, and data
    • The generative model is a formula or computer expression that can generate simulated data based on provided input parameters
    • For example, could assume that there is a proportion of zombies cured and a number of zombies treated and then simulate data based on these

Using a generative model:

  • The binomial distribution function can be very helpful as a generative model for summing probabilities of single 1/0 events
  • Typically in data analysis, we know the outcome and want to figure out the likely parameters in our generative model (that is the Bayesian inference)

Repressing uncertainty with priors:

  • The prior reflects our certainty/uncertainty in the parameters prior to running the analysis
    • Example could be a uniform distribution from (a, b)
    • proportion_clicks <- runif(n = 6, min = 0.0, max = 1.0) # sample 6 values that are between 0 and 1 with every number being equally likely
    • n_clicks <- rbinom(n = 6, size = 100, proportion_clicks) # rbinom will vectorize over proportion_clicks

Bayesian models and conditioning:

  • The Bayesian model is based on the generative model and the prior
    • prior <- data.frame(proportion_clicks, n_visitors) # joint PDF over proportion_clicks and n_visitors
  • Can then condition on the observed data and assess the joint PDF for the implications on the distribution of the underlying proportion
    • “Bayesian inference is conditioning on data, in order to learn about parameter values.”

Chapter wrap-up:

  • Used the binomial model as an assumed generative function and a uniform distribution as the prior probabilities
  • Calculated joint probability distributions and then found a probability distribution based on a known outcome (data)
  • The posterior can then be used as the prior for future analyses, and can be repeated indefinitely
  • Bayesian machinery from simple cases can be extended to more complex cases
    • Just need a generative model and an assumption for the prior
    • Need a computational model that can scale easily

Example code includes:

# Generative zombie drug model
# Parameters
prop_success <- 0.42
n_zombies <- 100
# Simulating data
data <- c()
for(zombie in 1:n_zombies) {
  data[zombie] <- runif(1, min = 0, max = 1) < prop_success
}
data <- as.numeric(data)
data
##   [1] 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1
##  [36] 0 1 0 0 0 1 0 0 0 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 0 0
##  [71] 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0
data_counts <- sum(as.numeric(data))
data_counts
## [1] 46
# Try out rbinom
rbinom(n = 1, size = 100, prob = 0.42)
## [1] 31
# Try out rbinom
rbinom(n = 200, size = 100, prob = 0.42)
##   [1] 44 51 40 48 44 37 41 41 50 49 44 42 43 41 41 33 39 40 41 39 46 43 45
##  [24] 41 42 38 41 45 45 39 41 48 45 45 43 41 36 39 45 44 37 37 39 51 43 31
##  [47] 42 39 42 42 38 46 36 41 48 42 41 37 41 38 36 43 53 37 49 35 43 51 34
##  [70] 44 40 44 39 47 47 40 43 38 41 38 40 36 35 37 37 40 41 47 39 39 45 45
##  [93] 47 44 36 49 33 40 44 39 40 36 42 38 47 39 48 48 50 42 51 39 44 42 45
## [116] 38 44 40 40 41 40 46 43 40 46 49 40 48 43 43 39 39 41 44 45 43 43 41
## [139] 48 38 47 42 45 38 44 44 38 40 36 38 46 41 42 29 40 42 40 45 33 46 54
## [162] 47 36 49 41 41 39 40 44 43 52 41 38 38 45 49 49 39 47 40 42 37 43 46
## [185] 42 41 47 48 38 50 42 40 38 36 49 39 46 37 42 36
# Fill in the parameters
n_samples <- 100000
n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- rbinom(n_samples, size = n_ads_shown, prob = proportion_clicks)

# Visualize the results
hist(n_visitors)

# Update proportion_clicks
n_samples <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n = n_samples, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_samples, size = n_ads_shown, prob = proportion_clicks)

# Visualize the results
hist(n_visitors)

hist(proportion_clicks)

# Create prior
prior <- data.frame(proportion_clicks, n_visitors)
head(prior)
##   proportion_clicks n_visitors
## 1       0.102175374          6
## 2       0.066737942          6
## 3       0.007708941          0
## 4       0.123223223         13
## 5       0.070207388          6
## 6       0.112998033          7
# Create posterior
posterior <- prior[prior$n_visitors==13, ]
hist(posterior$proportion_clicks)

prior <- posterior
head(prior)
##    proportion_clicks n_visitors
## 4          0.1232232         13
## 21         0.1409067         13
## 44         0.0885686         13
## 49         0.1316969         13
## 61         0.1619553         13
## 76         0.1349985         13
prior$n_visitors <- rbinom(nrow(prior), size=100, prob=prior$proportion_clicks)
hist(prior$n_visitors)

mean(prior$n_visitors >= 5)
## [1] 0.9867033

Chapter 3 - Why Use Bayesian Data Analysis?

Four good things with Bayes:

  • Many good tools exist for Bayesian analysis, and those will be covered in Chapter 5
  • The main reasons for Bayesian analysis are the flexibility and power
    • You can include information sources such as expertise in addition to the data
    • You can make any comparisons between groups or data sets
    • You can use the results of Bayesian analysis for Decision Analysis
    • You can change the underlying statistical model
  • Background information, common knowledge, or expertise can be incorporated in to the prior
    • Can also exclude all such information by assuming a uniform distribution for the prior
    • The beta distribution can be useful for a proportion, and is set based on alpha and beta

Contrasts and comparisons:

  • The more data, the more likely that the posterior is informed by the data than by the prior
  • There is often a benefit to comparing groups - for example, two different treatments or two different campaigns
    • Can compare the posterior distributions from the two groups, which is fairly simple since each distribution is contained in a vector
    • posterior\(prop_diff <- posterior\)video_prop - posterior$text_prop

Decision analysis:

  • Can calculate median, mean, credible interval, likelihood of more extreme than a certain parameter, etc.
  • The results of a Bayesian analysis can be used for Decision analysis
  • Can add dimensions such as cost and revenue, then compare profitability
    • video_cost <- 0.25
    • text_cost <- 0.05
    • visitor_spend <- 2.53
    • posterior\(video_profit <- posterior\)video_prop * visitor_spend - video_cost
    • posterior\(text_profit <- posterior\)text_prop * visitor_spend - text_cost
    • posterior\(profit_diff <- posterior\)video_profit - posterior$text_profit

Change anything and everything:

  • There can be large uncertainty as to the outcomes, particularly if the data sizes are small and the key metrics are close to the critical parameter
  • Can switch out the generating functions and re-run the approach - example of a banner that is pay-per-day rather than pay-per-impression
    • One option is to split each day in to 1440 minutes and assume a probability of success (1/0), which has drawbacks
    • In the limiting case of smaller and smaller slices, the Poisson distribution is created - has only a single parameter, the expected value of successes
    • n_clicks <- rpois(n = 100000, labmda = 20)

Bayes is optimal, kind of . . .

  • Bayesian analysis can be useful in many ways as shown in this chapter
  • Bayesian analysis is (kind of) optimal, provided that the generative model is perfectly precise and accurate
    • In the world defined by the model the Bayesian approach is optimal
    • However, no statistical model can ever be optimal in the real world

Example code includes:

# Draw from the beta distribution
beta_sample <- rbeta(n = 1000000, shape1 = 1, shape2 = 1)

# Explore the results
hist(beta_sample)

# Draw from the beta distribution
beta_sample <- rbeta(n = 10000, shape1 = 100, shape2 = 100)

# Explore the results
hist(beta_sample)

# Draw from the beta distribution
beta_sample <- rbeta(n = 10000, shape1 = 100, shape2 = 20)

# Explore the results
hist(beta_sample)

n_draws <- 100000
n_ads_shown <- 100

# Update proportion_clicks
proportion_clicks <- rbeta(n_draws, shape1 = 5, shape2 = 95)
n_visitors <- rbinom(n_draws, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 13, ]

# Plots the prior and the posterior in the same plot
par(mfcol = c(2, 1))
hist(prior$proportion_clicks, 
     xlim = c(0, 0.25))
hist(posterior$proportion_clicks, 
     xlim = c(0, 0.25))

# Reset mfcol below

# Define parameters
n_draws <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_draws, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_draws, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)

# Create posteriors
posterior_video <- prior[prior$n_visitors == 13, ]
posterior_text <- prior[prior$n_visitors == 6, ]

# Visualize posteriors
hist(posterior_video$proportion_clicks, xlim = c(0, 0.25))
hist(posterior_text$proportion_clicks, xlim = c(0, 0.25))

posterior <- data.frame(video_prop = posterior_video$proportion_clicks[1:4000], 
                        text_prop  = posterior_text$proportion_click[1:4000]
                        )
    
# Create prop_diff
posterior$prop_diff <- posterior$video_prop - posterior$text_prop

# Plot your new column
hist(posterior$prop_diff)

# Explore prop_diff
median(posterior$prop_diff)
## [1] 0.06583102
mean(posterior$prop_diff > 0)
## [1] 0.947
visitor_spend <- 2.53
video_cost <- 0.25
text_cost <- 0.05

posterior$video_profit <- posterior$video_prop * visitor_spend - video_cost
posterior$text_profit <- posterior$text_prop * visitor_spend - text_cost
head(posterior)
##   video_prop  text_prop     prop_diff video_profit text_profit
## 1 0.11438338 0.06966232  0.0447210643   0.03938996  0.12624567
## 2 0.08828099 0.08925227 -0.0009712789  -0.02664909  0.17580825
## 3 0.13337523 0.03383959  0.0995356356   0.08743932  0.03561416
## 4 0.11821430 0.08126194  0.0369523543   0.04908217  0.15559272
## 5 0.10437450 0.07691290  0.0274615984   0.01406748  0.14458963
## 6 0.13520477 0.07912402  0.0560807510   0.09206808  0.15018378
hist(posterior$video_profit)

hist(posterior$text_profit)


posterior$profit_diff <- posterior$video_profit - posterior$text_profit
head(posterior)
##   video_prop  text_prop     prop_diff video_profit text_profit profit_diff
## 1 0.11438338 0.06966232  0.0447210643   0.03938996  0.12624567 -0.08685571
## 2 0.08828099 0.08925227 -0.0009712789  -0.02664909  0.17580825 -0.20245734
## 3 0.13337523 0.03383959  0.0995356356   0.08743932  0.03561416  0.05182516
## 4 0.11821430 0.08126194  0.0369523543   0.04908217  0.15559272 -0.10651054
## 5 0.10437450 0.07691290  0.0274615984   0.01406748  0.14458963 -0.13052216
## 6 0.13520477 0.07912402  0.0560807510   0.09206808  0.15018378 -0.05811570
hist(posterior$profit_diff)

median(posterior$profit_diff)
## [1] -0.03344751
mean(posterior$profit_diff < 0)
## [1] 0.6345
x <- rpois(n = 10000, lambda = 3)
hist(x)

x <- rpois(n = 10000, lambda = 11.5)
hist(x)

x <- rpois(n = 10000, lambda = 11.5)
mean(x >= 15)
## [1] 0.182
n_draws <- 100000
n_ads_shown <- 100
mean_clicks <- runif(n_draws, min = 0, max = 80)
n_visitors <- rpois(n_draws, lambda=mean_clicks)
                     
prior <- data.frame(mean_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 19, ]

hist(prior$mean_clicks)
hist(posterior$mean_clicks)

# Reset to default
par(mfcol = c(1, 1))

Chapter 4 - Bayesian Inference with Bayes’ Theorem

Probability rules:

  • The computation method used so far does not scale well, but there are alternatives
  • Bayesian statistics is a hot research error, and there are many methods to get to the same results in a faster method
  • Probability is defined as a statement of certainty/uncertainty between 0 and 1 (may be a distribution or an allocation of probabilities over all possible values)
  • Conditional probabilities are often of interest - P(A | B) is the probability of A given that B has occurred
    • Can also get a conditional probability distribution
  • Sometimes probabilities can be summed - when they are exclusive and the goal is to get 1 of them
  • Sometimes probabilities can be multiplied - when they are independent and the goal is to get all of them

Calculating likelihoods:

  • Can simulate or calculate probabilities - simulate and count with a common generative model and small dataset, or calculate using an optimized formula
  • For common distributions, can use the density functions such as dbinom() to get the key probabilities in many cases
    • dbinom(13, size = 100, prob = 0.1) + dbinom(14, size = 100, prob = 0.1) # probability of getting 13 or 14 successes in 100 trials with success 0.1 per trial
    • n_visitors = seq(0, 100, by = 1)
    • probability <- dbinom(n_visitors, size = 100, prob = 0.1)
  • For continuous distributions such as the uniform, there is no specific probability of any given value
    • Probability densities are returned instead, and can be viewed as the relative probabilities

Bayesian calculation:

  • Conditioning on the observed data is at the core of Bayesian inference
  • Example of converting previous simulations to precise calculations
    • n_ads_shown <- 100
    • n_visitors <- seq(0, 100, by = 1) # full potential range of visitors based on ads
    • proportion_clicks <- seq(0, 1, by = 0.01) # fine grid of values even if it does not fully sample the given space
    • pars <- expand.grid(proportion_clicks = proportion_clicks, n_visitors = n_visitors) # all possible combinations of n_visitors and proportion_clicks
    • pars\(prior <- dunif(pars\)proportion_clicks, min = 0, max = 0.2) # prior is of uniform density
    • pars\(likelihood <- dbinom(pars\)n_visitors, size = n_ads_shown, prob = pars$proportion_clicks) # likelihood given the rows in pars
    • pars\(probability <- pars\)likelihood * pars$prior # unscaled posterior probability
    • pars\(probability <- pars\)probability / sum(pars$probability) # normalized posterior probability
    • pars <- pars[pars$n_visitors == 13, ] # filter on the 13 observed clicks
    • pars\(probability <- pars\)probability / sum(pars$probability) # normalize remaining probs to 1

Bayes theorem:

  • An example of Bayes’ theorem is provided above by the multiplication of the prior and the probability
    • pars\(probability <- pars\)likelihood * pars$prior
    • pars\(probability <- pars\)probability / sum(pars$probability)
    • This is an example of p(Param | Data) = P(Data | Param) * P(Param Before Seeing Data) / sum-of-all-numerators
  • The grid approximation technique was used above - cannot ever get all parameters for a continuous distribution
  • Can use mathematical notations where = is a point parameter and ~ is follows a specific distribution

Example code includes:

prob_to_draw_ace <- 4 / 52
prob_to_draw_four_aces <- (4 / 52) * (3 / 51) * (2 / 50) * (1 / 49)


n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- rbinom(n = 99999, 
    size = n_ads_shown, prob = proportion_clicks)
prob_13_visitors <- sum(n_visitors == 13) / length(n_visitors)
prob_13_visitors
## [1] 0.07536075
prob_13_visitors <- dbinom(x=13, size=n_ads_shown, prob=proportion_clicks)
prob_13_visitors
## [1] 0.07430209
n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- 0:n_ads_shown
prob <- dbinom(n_visitors, 
    size = n_ads_shown, prob = proportion_clicks)
prob
##   [1]  2.656140e-05  2.951267e-04  1.623197e-03  5.891602e-03  1.587460e-02
##   [6]  3.386580e-02  5.957873e-02  8.889525e-02  1.148230e-01  1.304163e-01
##  [11]  1.318653e-01  1.198776e-01  9.878801e-02  7.430209e-02  5.130383e-02
##  [16]  3.268244e-02  1.929172e-02  1.059153e-02  5.426525e-03  2.602193e-03
##  [21]  1.170987e-03  4.956559e-04  1.977617e-04  7.451890e-05  2.656461e-05
##  [26]  8.972934e-06  2.875940e-06  8.758007e-07  2.537042e-07  6.998736e-08
##  [31]  1.840408e-08  4.617512e-09  1.106279e-09  2.532895e-10  5.545880e-11
##  [36]  1.161994e-11  2.331161e-12  4.480309e-13  8.253201e-14  1.457830e-14
##  [41]  2.470212e-15  4.016606e-16  6.269305e-17  9.395858e-18  1.352434e-18
##  [46]  1.870032e-19  2.484342e-20  3.171501e-21  3.890962e-22  4.587982e-23
##  [51]  5.199713e-24  5.664175e-25  5.930440e-26  5.967739e-27  5.771270e-28
##  [56]  5.363200e-29  4.788572e-30  4.107157e-31  3.383290e-32  2.676049e-33
##  [61]  2.031815e-34  1.480375e-35  1.034671e-36  6.934301e-38  4.454325e-39
##  [66]  2.741123e-40  1.615140e-41  9.106926e-43  4.910597e-44  2.530420e-45
##  [71]  1.245128e-46  5.845669e-48  2.616117e-49  1.114936e-50  4.520010e-52
##  [76]  1.741041e-53  6.363454e-55  2.203794e-56  7.220406e-58  2.234162e-59
##  [81]  6.516307e-61  1.787738e-62  4.602579e-64  1.109055e-65  2.493907e-67
##  [86]  5.216014e-69  1.010856e-70  1.807405e-72  2.966699e-74  4.444493e-76
##  [91]  6.035732e-78  7.369636e-80  8.010474e-82  7.656367e-84  6.335055e-86
##  [96]  4.445653e-88  2.572716e-90  1.178793e-92  4.009500e-95  9.000000e-98
## [101] 1.000000e-100
plot(x=n_visitors, y=prob, type="h")

n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- 13
prob <- dbinom(n_visitors, 
    size = n_ads_shown, prob = proportion_clicks)
prob
##   [1]  0.000000e+00  2.965956e-11  1.004526e-07  8.009768e-06  1.368611e-04
##   [6]  1.001075e-03  4.265719e-03  1.247940e-02  2.764481e-02  4.939199e-02
##  [11]  7.430209e-02  9.703719e-02  1.125256e-01  1.178532e-01  1.129620e-01
##  [16]  1.001234e-01  8.274855e-02  6.419966e-02  4.701652e-02  3.265098e-02
##  [21]  2.158348e-02  1.362418e-02  8.234325e-03  4.775927e-03  2.663369e-03
##  [26]  1.430384e-03  7.408254e-04  3.704422e-04  1.790129e-04  8.366678e-05
##  [31]  3.784500e-05  1.657584e-05  7.032793e-06  2.891291e-06  1.151996e-06
##  [36]  4.448866e-07  1.665302e-07  6.041614e-08  2.124059e-08  7.234996e-09
##  [41]  2.386939e-09  7.624614e-10  2.357105e-10  7.048636e-11  2.037726e-11
##  [46]  5.691404e-12  1.534658e-12  3.991862e-13  1.000759e-13  2.415778e-14
##  [51]  5.609229e-15  1.251336e-15  2.678760e-16  5.495443e-17  1.078830e-17
##  [56]  2.023515e-18  3.620178e-19  6.166397e-20  9.980560e-21  1.531703e-21
##  [61]  2.223762e-22  3.046572e-23  3.927965e-24  4.752038e-25  5.377247e-26
##  [66]  5.671478e-27  5.554432e-28  5.030231e-29  4.193404e-30  3.201904e-31
##  [71]  2.227032e-32  1.402449e-33  7.942805e-35  4.015572e-36  1.797200e-37
##  [76]  7.054722e-39  2.403574e-40  7.024314e-42  1.737424e-43  3.582066e-45
##  [81]  6.048981e-47  8.199196e-49  8.713462e-51  7.062754e-53  4.226413e-55
##  [86]  1.795925e-57  5.170371e-60  9.521923e-63  1.044590e-65  6.239308e-69
##  [91]  1.807405e-72  2.180415e-76  8.911963e-81  9.240821e-86  1.591196e-91
##  [96]  2.358848e-98 1.001493e-106 1.546979e-117 8.461578e-133 6.239651e-159
## [101]  0.000000e+00
plot(x=proportion_clicks, y=prob, type="h")

n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- seq(0, 100, by = 1)
pars <- expand.grid(proportion_clicks = proportion_clicks,
                    n_visitors = n_visitors)
pars$prior <- dunif(pars$proportion_clicks, min = 0, max = 0.2)
pars$likelihood <- dbinom(pars$n_visitors, 
    size = n_ads_shown, prob = pars$proportion_clicks)
pars$probability <- pars$likelihood * pars$prior
pars$probability <- pars$probability / sum(pars$probability)
pars_conditioned <- pars[pars$n_visitors==6, ]
pars_conditioned$probability <- pars_conditioned$probability / sum(pars_conditioned$probability)
plot(x=pars_conditioned$proportion_clicks, y=pars_conditioned$probability, type="h")

# Simplify slightly for a known result of 6
n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- 6
pars <- expand.grid(proportion_clicks = proportion_clicks,
                    n_visitors = n_visitors)
pars$prior <- dunif(pars$proportion_clicks, min = 0, max = 0.2)
pars$likelihood <- dbinom(pars$n_visitors, 
    size = n_ads_shown, prob = pars$proportion_clicks)
pars$probability <- pars$likelihood * pars$prior
pars$probability <- pars$probability / sum(pars$probability)
plot(pars$proportion_clicks, pars$probability, type = "h")


Chapter 5 - More Parameters, Data, and Bayes

Temperature in a normal lake:

  • Example of having some water data temperature for a given day
    • temp <- c(19, 23, 20, 17, 23)
    • The normal distribution could be a good candidate for the generative function in this case - parameters with mu and sigma
    • rnorm(n = , mean = , sd = )
    • dnorm(x = , mean = , sd = )
    • like <- dnorm(x = temp, mean = 20, sd = 2) # likelihood of our observed temperatures given a mean of 20 and a standard deviation of 2
    • prod(like) # joint likelihood of probabilities
    • log(like) # addresses the problem of likelihoods being so small that computer precision becomes an issue

Bayesian model of water temperature:

  • Can define priors such as sigma ~Uniform(0, 10) and mean ~ N(18, 5)
  • Can then run an additional grid-approximation exercise
    • temp <- c(19, 23, 20, 17, 23)
    • mu <- seq(8, 30, by = 0.5)
    • sigma <- seq(0.1, 10, by = 0.3)
    • pars <- expand.grid(mu = mu, sigma = sigma))
    • pars\(mu_prior <- dnorm(pars\)mu, mean = 18, sd = 5)
    • pars\(sigma_prior <- dunif(pars\)sigma, min = 0, max = 10)
    • pars\(prior <- pars\)mu_prior * pars$sigma_prior
    • for(i in 1:nrow(pars)) {
    • likelihoods <- dnorm(temp, pars\(mu[i], pars\)sigma[i])
    • pars$likelihood[i] <- prod(likelihoods)
    • }
    • pars\(probability <- pars\)likelihood * pars$prior
    • pars\(probability <- pars\)probability / sum(pars$probability)

Beach party implications of water temperatures:

  • Suppose that there is aminimum water temperature for holding a beach party and that we want the probability of exceedence of this temperature
  • It is helpful to create a frame for further analysis; random sampling, weighted by probability, can help
    • sample_indices <- sample( 1:nrow(pars), size = 10000, replace = TRUE, prob = pars$probability) # draw some random samples proportional to each row’s probability
    • pars_sample <- pars[sample_indices, c(“mu”, “sigma”)]
    • hist(pars_sample$mu, 30)
    • pred_temp <- rnorm(10000, mean = pars_sample\(mu, sd = pars_sample\)sigma)
    • hist(pred_temp, 30)
    • sum(pred_temp >= 18) / length(pred_temp )

Practical tool (BEST):

  • Models are often available off-the-shelf which can save time; one example is BEST by John Kruschke
    • BEST assumes that data come from a t-distribution (more or less a normal distribution with the degrees of freedom added)
    • BEST estimates standard deviation, mean, and degrees of freedom
    • BEST uses MCMC (Markov chain Monte Carlo)
  • Can fit the data directly using BEST
    • library(BEST)
    • iq <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46)
    • fit <- BESTmcmc(iq)
    • fit # note that nu is the degrees of freedom
    • plot(fit)

Wrap and up and next steps:

  • Bayesian inference as a technique for modeling uncertainty - data, generative model, and prior probability distributions
  • Can use sampling and grid approximation to calculate probabilities and distributions
    • Under-the-hood, used MCMC as implemented by way of BEST
  • Additional areas for exploration include full application of Bayesian approaches to time series, deep learning, and the like
  • Can also add more advanced computational models

Example code includes:

mu <- 3500
sigma <- 600

weight_distr <- rnorm(n = 100000, mean = mu, sd = sigma)
hist(weight_distr, xlim = c(0, 6000), col = "lightgreen")

mu <- 3500
sigma <- 600

weight <- seq(0, 6000, by=100)
likelihood <- dnorm(weight, mean=mu, sd=sigma)

plot(x=weight, y=likelihood, type="h")

# The IQ of a bunch of zombies
iq <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46)
# Defining the parameter grid
pars <- expand.grid(mu = seq(0, 150, length.out = 100), 
                    sigma = seq(0.1, 50, length.out = 100))
# Defining and calculating the prior density for each parameter combination
pars$mu_prior <- dnorm(pars$mu, mean = 100, sd = 100)
pars$sigma_prior <- dunif(pars$sigma, min = 0.1, max = 50)
pars$prior <- pars$mu_prior * pars$sigma_prior
# Calculating the likelihood for each parameter combination
for(i in 1:nrow(pars)) {
  likelihoods <- dnorm(iq, pars$mu[i], pars$sigma[i])
  pars$likelihood[i] <- prod(likelihoods)
}
# Calculating the probability of each parameter combination
pars$probability <- pars$likelihood * pars$prior / sum(pars$likelihood * pars$prior)
lattice::levelplot(probability ~ mu * sigma, data = pars)

head(pars)
##         mu sigma    mu_prior sigma_prior        prior likelihood
## 1 0.000000   0.1 0.002419707  0.02004008 4.849113e-05          0
## 2 1.515152   0.1 0.002456367  0.02004008 4.922578e-05          0
## 3 3.030303   0.1 0.002493009  0.02004008 4.996010e-05          0
## 4 4.545455   0.1 0.002529617  0.02004008 5.069373e-05          0
## 5 6.060606   0.1 0.002566174  0.02004008 5.142633e-05          0
## 6 7.575758   0.1 0.002602661  0.02004008 5.215754e-05          0
##   probability
## 1           0
## 2           0
## 3           0
## 4           0
## 5           0
## 6           0
sample_indices <- sample( nrow(pars), size = 10000,
    replace = TRUE, prob = pars$probability)
head(sample_indices)
## [1] 2827 2728 3025 3126 4035 4727
pars_sample <- pars[sample_indices, c("mu", "sigma")]
hist(pars_sample$mu)

quantile(pars_sample$mu, c(0.025, 0.5, 0.975))
##     2.5%      50%    97.5% 
## 34.84848 42.42424 50.00000
head(pars_sample)
##            mu    sigma
## 2827 39.39394 14.21313
## 2728 40.90909 13.70909
## 3025 36.36364 15.22121
## 3126 37.87879 15.72525
## 4035 51.51515 20.26162
## 4727 39.39394 23.78990
pred_iq <- rnorm(10000, mean = pars_sample$mu, sd = pars_sample$sigma)
hist(pred_iq)

mean(pred_iq >= 60)
## [1] 0.0886
# The IQ of zombies on a regular diet and a brain based diet.
iq_brains <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
iq_regular <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46)
mean(iq_brains) - mean(iq_regular)
## [1] 9.3
# Need to load http://www.sourceforge.net/projects/mcmc-jags/files for rjags (called by BEST)
# library(BEST)
# best_posterior <- BESTmcmc(iq_brains, iq_regular)
# plot(best_posterior)

Categorical Data in the Tidyverse

Chapter 1 - Introduction to Factor Variables

Introduction to qualitative variables:

  • Identifying and inspecting categorical data, using the forcats package, effective visualization
  • Qualitative data in this course will include categorical data and ordianl data
    • Each have a fixed and known set of possible values - ordinal adds that there is an ordering, though no specific meaning such that 1 vs 2 and 2 vs 3 may be different distances
  • Categorical data is generally best converted to factors iff there is a finite set of potential values
  • Can check for factors using is.factor()

Understanding your qualitative variables:

  • Data is from the Kaggle 2017 data science survey
  • High-level summaries include category names and levels; converting to factors can be valuable
    • multipleChoiceResponses %>% mutate_if(is.character, as.factor) # nice function! Run the mutate only if the first condition (is.character) holds
    • nlevels(multipleChoiceResponses$LearningDataScienceTime) # number of levels
    • levels(multipleChoiceResponses$LearningDataScienceTime) # level names
    • multipleChoiceResponses %>% summarise_if(is.factor, nlevels) # run summarize only for the factors

Making better plots:

  • Can use forcats::fct_reorder() to reorder data for plotting - fct_reorder(factor, orderingCriteria)
    • ggplot(WorkChallenges) + geom_point(aes(x = fct_reorder(question, perc_problem), y = perc_problem))
  • Can use forcats::fct_infreq() to order based on frequency and reverse the order using forcats::fct_rev() as needed
    • ggplot(multiple_choice_responses) + geom_bar(aes(x = fct_infreq(CurrentJobTitleSelect))
    • ggplot(multiple_choice_responses) + geom_bar(aes(x = fct_rev(fct_infreq(CurrentJobTitleSelect))))

Example code includes:

multiple_choice_answers <- readr::read_csv("./RInputFiles/smc_with_js.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Age = col_double()
## )
## See spec(...) for full column specifications.
# Print out the dataset
glimpse(multiple_choice_answers)
## Observations: 16,716
## Variables: 47
## $ LearningPlatformUsefulnessArxiv             <chr> NA, NA, "Very usef...
## $ LearningPlatformUsefulnessBlogs             <chr> NA, NA, NA, "Very ...
## $ LearningPlatformUsefulnessCollege           <chr> NA, NA, "Somewhat ...
## $ LearningPlatformUsefulnessCompany           <chr> NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessConferences       <chr> "Very useful", NA,...
## $ LearningPlatformUsefulnessFriends           <chr> NA, NA, NA, "Very ...
## $ LearningPlatformUsefulnessKaggle            <chr> NA, "Somewhat usef...
## $ LearningPlatformUsefulnessNewsletters       <chr> NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessCommunities       <chr> NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessDocumentation     <chr> NA, NA, NA, "Very ...
## $ LearningPlatformUsefulnessCourses           <chr> NA, NA, "Very usef...
## $ LearningPlatformUsefulnessProjects          <chr> NA, NA, NA, "Very ...
## $ LearningPlatformUsefulnessPodcasts          <chr> "Very useful", NA,...
## $ LearningPlatformUsefulnessSO                <chr> NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessTextbook          <chr> NA, NA, NA, NA, "S...
## $ LearningPlatformUsefulnessTradeBook         <chr> "Somewhat useful",...
## $ LearningPlatformUsefulnessTutoring          <chr> NA, NA, NA, NA, NA...
## $ LearningPlatformUsefulnessYouTube           <chr> NA, NA, "Very usef...
## $ CurrentJobTitleSelect                       <chr> "DBA/Database Engi...
## $ MLMethodNextYearSelect                      <chr> "Random Forests", ...
## $ WorkChallengeFrequencyPolitics              <chr> "Rarely", NA, NA, ...
## $ WorkChallengeFrequencyUnusedResults         <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyUnusefulInstrumenting <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyDeployment            <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyDirtyData             <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyExplaining            <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyPass                  <chr> NA, NA, NA, NA, NA...
## $ WorkChallengeFrequencyIntegration           <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyTalent                <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyDataFunds             <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyDomainExpertise       <chr> NA, NA, NA, "Most ...
## $ WorkChallengeFrequencyML                    <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyTools                 <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyExpectations          <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyITCoordination        <chr> NA, NA, NA, NA, "S...
## $ WorkChallengeFrequencyHiringFunds           <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyPrivacy               <chr> "Often", NA, NA, "...
## $ WorkChallengeFrequencyScaling               <chr> "Most of the time"...
## $ WorkChallengeFrequencyEnvironments          <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyClarity               <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyDataAccess            <chr> NA, NA, NA, "Often...
## $ WorkChallengeFrequencyOtherSelect           <chr> NA, NA, NA, NA, NA...
## $ WorkInternalVsExternalTools                 <chr> "Do not know", NA,...
## $ FormalEducation                             <chr> "Bachelor's degree...
## $ Age                                         <dbl> NA, 30, 28, 56, 38...
## $ DataScienceIdentitySelect                   <chr> "Yes", "Yes", "Yes...
## $ JobSatisfaction                             <chr> "5", NA, NA, "10 -...
# Check if CurrentJobTitleSelect is a factor
is.factor(multiple_choice_answers$CurrentJobTitleSelect)
## [1] FALSE
# mutate() and summarise() in dplyr both have variants where you can add the suffix if, all, or at to change the operation
# mutate_if() applies a function to all columns where the first argument is true
# mutate_all() applies a function to all columns
# mutate_at() affects columns selected with a character vector or select helpers (e.g. mutate_at(c("height", "weight"), log))

# Change all the character columns to factors
responses_as_factors <- multiple_choice_answers %>%
    mutate_if(is.character, as.factor)

# Make a two column dataset with variable names and number of levels
number_of_levels <- responses_as_factors %>%
    summarise_all(nlevels) %>%
    gather(variable, num_levels)


# dplyr has two other functions that can come in handy when exploring a dataset
# The first is top_n(x, var), which gets us the first x rows of a dataset based on the value of var
# The other is pull(), which allows us to extract a column and take out the name, leaving only the value(s) from the column

# Select the 4 rows with the highest number of levels
number_of_levels %>%
    top_n(4, num_levels)
## # A tibble: 4 x 2
##   variable               num_levels
##   <chr>                       <int>
## 1 CurrentJobTitleSelect          16
## 2 MLMethodNextYearSelect         25
## 3 FormalEducation                 7
## 4 JobSatisfaction                11
# How many levels does CurrentJobTitleSelect have? 
number_of_levels %>%
    filter(variable=="CurrentJobTitleSelect") %>%
    pull(num_levels)
## [1] 16
# Get the names of the levels of CurrentJobTitleSelect
responses_as_factors %>%
    pull(CurrentJobTitleSelect) %>%
    levels()
##  [1] "Business Analyst"                    
##  [2] "Computer Scientist"                  
##  [3] "Data Analyst"                        
##  [4] "Data Miner"                          
##  [5] "Data Scientist"                      
##  [6] "DBA/Database Engineer"               
##  [7] "Engineer"                            
##  [8] "Machine Learning Engineer"           
##  [9] "Operations Research Practitioner"    
## [10] "Other"                               
## [11] "Predictive Modeler"                  
## [12] "Programmer"                          
## [13] "Researcher"                          
## [14] "Scientist/Researcher"                
## [15] "Software Developer/Software Engineer"
## [16] "Statistician"
# Make a bar plot
ggplot(multiple_choice_answers, aes(x=FormalEducation)) + 
    geom_bar() + 
    coord_flip()

# Make a bar plot
ggplot(multiple_choice_answers, aes(x=fct_rev(fct_infreq(FormalEducation)))) + 
    geom_bar() + 
    coord_flip()

multiple_choice_answers %>%
  filter(!is.na(Age) & !is.na(FormalEducation)) %>%
  group_by(FormalEducation) %>%
  summarize(mean_age = mean(Age)) %>%
  ggplot(aes(x = fct_reorder(FormalEducation, mean_age), y = mean_age)) + 
    geom_point() + 
    coord_flip()


Chapter 2 - Manipulating Factor Variables

Reordering factors:

  • Can order by frequency or by another variable for pure categorical variables
  • For ordinal variables, typically is best to order by the implied order inside the ordinal variable - can manually enter using fct_relevel()
    • ggplot(aes(nlp_frequency, x = fct_relevel(response, “Rarely”, “Sometimes”, “Often”, “Most of the time”))) + geom_bar()
    • nlp_frequency %>% pull(response) %>% levels()
    • nlp_frequency %>% mutate(response = fct_relevel(response, “Often”, “Most of the time”)) %>% pull(response) %>% levels() # This moves Often and Most of the time to the front, leaving others alone
    • nlp_frequency %>% mutate(response = fct_relevel(response, “Often”, “Most of the time”, after = 2)) %>% pull(response) %>% levels() # move these to after 2
    • nlp_frequency %>% mutate(response = fct_relevel(response, “Often”, “Most of the time”, after = Inf) %>% pull(response) %>% levels() # move this to the end

Renaming factor levels:

  • Can convert names for factor levels using forcats::fct_recode()
    • levels(flying_etiquette$middle_arm_rest_three) # get the initial levels
    • ggplot(flying_etiquette, aes(x = fct_infreq(middle_arm_rest_three))) + geom_bar() + coord_flip() + labs(x = “Arm rest opinions”) # labels are very wordy, graph is compressed
    • flying_etiquette %>% mutate(middle_arm_rest_three = fct_recode(middle_arm_rest_three,
    • “Other” = “Other (please specify)”, “Everyone should share” = “The arm rests should be shared”,
    • “Aisle and window people” = “The people in the aisle and window seats get both arm rests”,
    • “Middle person” = “The person in the middle seat gets both arm rests”,
    • “Fastest person” = “Whoever puts their arm on the arm rest first”)
    • ) %>%
    • count(middle_arm_rest_three)

Collapsing factor levels:

  • Can collapse factor levels using forcats::fct_collapse()
    • flying_etiquette %>% mutate(height = fct_collapse(height, under_5_3 = c(“Under 5 ft.”, “5’0"”, “5’1"”, “5’2"”), over_6_1 = c(“6’1"”, “6’2"”, “6’3"”, “6’4"”, “6’5"”, “6’6" and above”))) %>% pull(height) %>% levels()
  • Can collapse factor levels to other using forcats::fct_other()
    • flying_etiquette %>% mutate(new_height = fct_other(height, keep = c(“6’4"”, “5’1"”))) %>% count(new_height) # will make everything other than keep in to Other
    • flying_etiquette %>% mutate(new_height = fct_other(height, drop = c(“Under 5 ft.”, “5’0"”, “5’1"”, “5’2"”, “5’3"”))) %>% pull(new_height) %>% levels() # will move the drop items to Other
    • flying_etiquette %>% mutate(new_height = fct_lump(height, prop = .08)) %>% count(new_height) # anything less than a proportion of 0.08 will be moved to other
    • flying_etiquette %>% mutate(new_height = fct_lump(height, n = 3)) %>% count(new_height) # keep the top 3 categories

Example code includes:

multiple_choice_responses <- multiple_choice_answers

# Print the levels of WorkInternalVsExternalTools
levels(multiple_choice_responses$WorkInternalVsExternalTools)
## NULL
# Reorder the levels from internal to external 
mc_responses_reordered <- multiple_choice_responses %>%
    mutate(WorkInternalVsExternalTools = fct_relevel(WorkInternalVsExternalTools, 
                                                     c('Entirely internal', 'More internal than external',
                                                       'Approximately half internal and half external', 
                                                       'More external than internal', 'Entirely external',
                                                       'Do not know'
                                                       )
                                                     )
           )

# Make a bar plot of the responses
ggplot(mc_responses_reordered, aes(x=WorkInternalVsExternalTools)) + 
    geom_bar() + 
    coord_flip()

multiple_choice_responses %>%
    # Move "I did not complete any formal education past high school" and "Some college/university study without earning a bachelor's degree" to the front
    mutate(FormalEducation = fct_relevel(FormalEducation, c("I did not complete any formal education past high school", "Some college/university study without earning a bachelor's degree"))) %>%
    # Move "Doctoral degree" to be the sixth level
    mutate(FormalEducation = fct_relevel(FormalEducation, after=6, "Doctoral degree")) %>%
    # Move "I prefer not to answer" to be the last level.
    mutate(FormalEducation = fct_relevel(FormalEducation, after=Inf, "I prefer not to answer")) %>%
    # Examine the new level order
    pull(FormalEducation) %>%
    levels()
## [1] "I did not complete any formal education past high school"         
## [2] "Some college/university study without earning a bachelor's degree"
## [3] "Bachelor's degree"                                                
## [4] "Master's degree"                                                  
## [5] "Professional degree"                                              
## [6] "Doctoral degree"                                                  
## [7] "I prefer not to answer"
# make a bar plot of the frequency of FormalEducation
ggplot(multiple_choice_responses, aes(x=FormalEducation)) + 
    geom_bar()

multiple_choice_responses %>%
    # rename levels
    mutate(FormalEducation = fct_recode(FormalEducation, "High school" ="I did not complete any formal education past high school", "Some college" = "Some college/university study without earning a bachelor's degree")) %>%
    # make a bar plot of FormalEducation
    ggplot(aes(x=FormalEducation)) + 
    geom_bar()

multiple_choice_responses %>%
    # Create new variable, grouped_titles, by collapsing levels in CurrentJobTitleSelect
    mutate(grouped_titles = fct_collapse(CurrentJobTitleSelect, 
        "Computer Scientist" = c("Programmer", "Software Developer/Software Engineer"), 
        "Researcher" = "Scientist/Researcher", 
        "Data Analyst/Scientist/Engineer" = c("DBA/Database Engineer", "Data Scientist", 
                                              "Business Analyst", "Data Analyst", 
                                              "Data Miner", "Predictive Modeler"))) %>%
    # Turn every title that isn't now one of the grouped_titles into "Other"
    mutate(grouped_titles = fct_other(grouped_titles, 
                             keep = c("Computer Scientist", 
                                     "Researcher", 
                                     "Data Analyst/Scientist/Engineer"))) %>% 
    # Get a count of the grouped titles
    count(grouped_titles)
## Warning: Factor `grouped_titles` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 5 x 2
##   grouped_titles                      n
##   <fct>                           <int>
## 1 Data Analyst/Scientist/Engineer  4928
## 2 Computer Scientist               2556
## 3 Researcher                       1597
## 4 Other                            2749
## 5 <NA>                             4886
multiple_choice_responses %>%
  # remove NAs of MLMethodNextYearSelect
  filter(!is.na(MLMethodNextYearSelect)) %>%
  # create ml_method, which lumps all those with less than 5% of people into "Other"
  mutate(ml_method = fct_lump(MLMethodNextYearSelect, prop=0.05)) %>%
  # print the frequency of your new variable in descending order
  count(ml_method, sort=TRUE)
## # A tibble: 4 x 2
##   ml_method                n
##   <fct>                <int>
## 1 Other                 4405
## 2 Deep learning         4362
## 3 Neural Nets           1386
## 4 Time Series Analysis   680
multiple_choice_responses %>%
  # remove NAs 
  filter(!is.na(MLMethodNextYearSelect)) %>%
  # create ml_method, retaining the 5 most common methods and renaming others "other method" 
  mutate(ml_method = fct_lump(MLMethodNextYearSelect, 5, other_level="other method")) %>%
  # print the frequency of your new variable in descending order
  count(ml_method, sort=TRUE)
## # A tibble: 6 x 2
##   ml_method                n
##   <fct>                <int>
## 1 Deep learning         4362
## 2 other method          3401
## 3 Neural Nets           1386
## 4 Time Series Analysis   680
## 5 Bayesian Methods       511
## 6 Text Mining            493

Chapter 3 - Creating Factor Variables

Examining common themed variables:

  • Tidy data has each row as an observation and each column as a variable (generally, moving from wide to long)
    • multipleChoiceResponses %>% select(contains(“WorkChallengeFrequency”)) %>% gather(work_challenge, frequency)
    • work_challenges <- multipleChoiceResponses %>% select(contains(“WorkChallengeFrequency”)) %>% gather(work_challenge, frequency) %>%
    • mutate(work_challenge = str_remove(work_challenge, “WorkChallengeFrequency”)) # will remove the string “WorkChallengeFrequency” from column work_challenge
  • Can also convert the variables to 0/1 and then use for statistical summaries
    • work_challenges %>% filter(!is.na(frequency)) %>% mutate(frequency = if_else( frequency %in% c(“Most of the time”, “Often”), 1, 0) ) %>%
    • group_by(work_challenge) %>% summarise(perc_problem = mean(frequency))

Tricks of ggplot2:

  • Initial plots may not look so good, for example
    • ggplot(job_titles_by_perc, aes(x = CurrentJobTitleSelect,, y = perc_w_title)) + geom_point()
  • Can angle the tick axes for better readability
    • ggplot(job_titles_by_perc, aes(x = CurrentJobTitleSelect, y = perc_w_title)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
  • Can reorder by oreder of popularity using fct_reorder
    • ggplot(job_titles_by_perc, aes(x = fct_reorder(CurrentJobTitleSelect, perc_w_title), y = perc_w_title)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
    • ggplot(job_titles_by_perc, aes(x = fct_rev(fct_reorder(CurrentJobTitleSelect, perc_w_title)), y = perc_w_title)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
  • Can add axis labels and titles using labs()
    • g <- ggplot(job_titles_by_perc, aes(x = fct_rev(fct_reorder(CurrentJobTitleSelect, perc_w_title)), y = perc_w_title)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
    • g <- g + labs(x = “Job Title”, y = “Percent with title”)
  • Can change the y-axis ticks to be explicit percentages
    • g + scale_y_continuous(labels = scales::percent_format())

Changing and creating variables with dplyr::case_when():

  • Suppose that there is a simple vector to be recoded using simple business rules
    • x <- 1:20
    • case_when(x %% 15 == 0 ~ “fizz buzz”, x %% 3 == 0 ~ “fizz”, x %% 5 == 0 ~ “buzz”, TRUE ~ as.character(x) ) # TRUE is the all others
  • Conditions evaluate from first to last, and the first matching condition is acted on for that value
    • moods %>% mutate(action = case_when( mood == “happy” & status == “know it” ~ “clap your hands”, mood == “happy” & status == “do not know it” ~ “stomp your feet”, mood == “sad” ~ “look at puppies”, TRUE ~ “jump around”)

Example code includes:

learning_platform_usefulness <- multiple_choice_responses %>%
  # select columns with LearningPlatformUsefulness in title
  select(contains("LearningPlatformUsefulness")) %>%
  # change data from wide to long
  gather(learning_platform, usefulness) %>%
  # remove rows where usefulness is NA
  filter(!is.na(usefulness)) %>%
  # remove "LearningPlatformUsefulness" from each string in `learning_platform 
  mutate(learning_platform = str_remove(learning_platform, "LearningPlatformUsefulness"))


learning_platform_usefulness %>%
  # change dataset to one row per learning_platform usefulness pair with number of entries for each
  count(learning_platform, usefulness) %>%
  # use add_count to create column with total number of answers for that learning_platform
  add_count(learning_platform, wt=n, name="nn") %>%
  # create a line graph for each question with usefulness on x-axis and percentage of responses on y
  ggplot(aes(x = usefulness, y = n/nn, group = learning_platform)) + 
  geom_line() + 
  facet_wrap(~ learning_platform)

avg_usefulness <- learning_platform_usefulness %>%
    # If usefulness is "Not Useful", make 0, else 1 
    mutate(usefulness = ifelse(usefulness=="Not Useful", 0, 1)) %>%
    # Get the average usefulness by learning platform 
    group_by(learning_platform) %>%
    summarize(avg_usefulness = mean(usefulness))

# Make a scatter plot of average usefulness by learning platform 
ggplot(avg_usefulness, aes(x=learning_platform, y=avg_usefulness)) + 
    geom_point()

ggplot(avg_usefulness, aes(x = learning_platform, y = avg_usefulness)) + 
    geom_point() + 
    # rotate x-axis text by 90 degrees
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    # rename y and x axis labels
    labs(x="Learning Platform", y="Percent finding at least somewhat useful") + 
    # change y axis scale to percentage
    scale_y_continuous(labels = scales::percent)

ggplot(avg_usefulness, 
       aes(x = fct_rev(fct_reorder(learning_platform, avg_usefulness)), y = avg_usefulness)
       ) + 
    geom_point() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(x = "Learning Platform", y = "Percent finding at least somewhat useful") + 
    scale_y_continuous(labels = scales::percent)

# Check the min age
min(multiple_choice_responses$Age, na.rm=TRUE)
## [1] 0
# Check the max age
max(multiple_choice_responses$Age, na.rm=TRUE)
## [1] 100
sum(is.na(multiple_choice_responses$Age))
## [1] 331
multiple_choice_responses %>%
    # Eliminate any ages below 10 and above 90
    filter(between(Age, 10, 90)) %>%
    # Create the generation variable based on age
    mutate(generation=case_when(
      between(Age, 10, 22) ~ "Gen Z", 
      between(Age, 23, 37) ~ "Gen Y", 
      between(Age, 38, 52) ~ "Gen X", 
      between(Age, 53, 71) ~ "Baby Boomer", 
      between(Age, 72, 90) ~ "Silent"
    )) %>%
    # Get a count of how many answers in each generation
    count(generation)
## # A tibble: 5 x 2
##   generation      n
##   <chr>       <int>
## 1 Baby Boomer   832
## 2 Gen X        3162
## 3 Gen Y       10281
## 4 Gen Z        2037
## 5 Silent         37
multiple_choice_responses %>%
    # Filter out people who selected Data Scientist as their Job Title
    filter(!is.na(CurrentJobTitleSelect) & CurrentJobTitleSelect != "Data Scientist")  %>%
    # Create a new variable, job_identity
    mutate(job_identity = case_when(
        CurrentJobTitleSelect == "Data Analyst" & DataScienceIdentitySelect == "Yes" ~ "DS analysts", 
        CurrentJobTitleSelect == "Data Analyst" & DataScienceIdentitySelect %in% c("No", "Sort of (Explain more)") ~ "NDS analyst", 
        CurrentJobTitleSelect != "Data Analyst" & DataScienceIdentitySelect == "Yes" ~ "DS non-analysts", 
        TRUE ~ "NDS non analysts")
        ) %>%
    mutate(JobSat=case_when(
        is.na(JobSatisfaction) ~ NA_integer_,
        JobSatisfaction == "I prefer not to share" | JobSatisfaction == "NA" ~ NA_integer_, 
        JobSatisfaction == "1 - Highly Dissatisfied" ~ 1L, 
        JobSatisfaction == "10 - Highly Satisfied" ~ 10L, 
        TRUE ~ as.integer(JobSatisfaction))) %>%
    # Get the average job satisfaction by job_identity, removing NAs
    group_by(job_identity) %>%
    summarize(avg_js = mean(JobSat, na.rm=TRUE))
## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by
## coercion
## # A tibble: 4 x 2
##   job_identity     avg_js
##   <chr>             <dbl>
## 1 DS analysts        6.44
## 2 DS non-analysts    6.93
## 3 NDS analyst        6.14
## 4 NDS non analysts   6.43

Chapter 4 - Case Study on Flight Etiquette

Case study introduction:

  • Recreation of 538 dataset on flying etiquette
  • Need to begin by converting variable types and tidying the data and selecting key columns
    • wide_data %>% mutate_if(is.character, as.factor)
    • wide_data %>% gather(column, value)
    • wide_data %>% select(contains(“favorite”))

Data preparation and regex:

  • Names will need to be changed to something more succinct for plotting
    • gathered_data %>% distinct(response_var)
  • Regular expressions can be used in any computing language to find instances of general patterns
    • str_detect(“happy”, “.”) # the . Will match anything
    • str_detect(“happy”, “h.”) # TRUE
    • str_detect(“happy”, “y.”) # FALSE, since nothing follows the y
    • str_remove(string, “.*the “) # will remove everything up to and including the followed by a space

Recreating the plot:

  • The labs() command allows for both a caption and a subtitle
    • ggplot(mtcars, aes(disp, mpg)) + geom_point() + labs(x = “x axis label”, y = “y axis label”, title = “My title”, subtitle = “and a subtitle”, caption = “even a caption!”)
  • The geom_text() layer allows for adding the specific numbers to the plot
    • initial_plot + geom_text(aes(label = round(mean_mpg)))
    • initial_plot + geom_text(aes(label = round(mean_mpg), y = mean_mpg + 2)) # fix the issue where the text is on top of the tip of the bar
  • Can use the theme layer to modify the non-data layers of the plot
    • initial_plot + geom_text(aes(label = round(mean_mpg), y = mean_mpg + 2)) + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) # get rid of x and y ticks

End of course recap:

  • Basic forcats functions
  • Tidyverse functions
  • ggplot2 tricks
  • Recreating the 538 plot for airplane rudeness behaviors

Example code includes:

flying_etiquette <- read.csv("./RInputFiles/flying-etiquette.csv", stringsAsFactors = FALSE)
names(flying_etiquette) <- 
    stringr::str_replace_all(stringr::str_replace_all(names(flying_etiquette), "\\.", " "), "  ", " ")
names(flying_etiquette) <- stringr::str_trim(names(flying_etiquette))
names(flying_etiquette)[2:22] <- paste0(names(flying_etiquette)[2:22], "?")
names(flying_etiquette) <- stringr::str_replace_all(names(flying_etiquette), "itrude", "it rude")
glimpse(flying_etiquette)
## Observations: 1,040
## Variables: 27
## $ RespondentID                                                                                                                              <dbl> ...
## $ `How often do you travel by plane?`                                                                                                       <chr> ...
## $ `Do you ever recline your seat when you fly?`                                                                                             <chr> ...
## $ `How tall are you?`                                                                                                                       <chr> ...
## $ `Do you have any children under 18?`                                                                                                      <chr> ...
## $ `In a row of three seats who should get to use the two arm rests?`                                                                        <chr> ...
## $ `In a row of two seats who should get to use the middle arm rest?`                                                                        <chr> ...
## $ `Who should have control over the window shade?`                                                                                          <chr> ...
## $ `Is it rude to move to an unsold seat on a plane?`                                                                                        <chr> ...
## $ `Generally speaking is it rude to say more than a few words tothe stranger sitting next to you on a plane?`                               <chr> ...
## $ `On a 6 hour flight from NYC to LA how many times is it acceptable to get up if you re not in an aisle seat?`                             <chr> ...
## $ `Under normal circumstances does a person who reclines their seat during a flight have any obligation to the person sitting behind them?` <chr> ...
## $ `Is it rude to recline your seat on a plane?`                                                                                             <chr> ...
## $ `Given the opportunity would you eliminate the possibility of reclining seats on planes entirely?`                                        <chr> ...
## $ `Is it rude to ask someone to switch seats with you in order to be closer to friends?`                                                    <chr> ...
## $ `Is it rude to ask someone to switch seats with you in order to be closer to family?`                                                     <chr> ...
## $ `Is it rude to wake a passenger up if you are trying to go to the bathroom?`                                                              <chr> ...
## $ `Is it rude to wake a passenger up if you are trying to walk around?`                                                                     <chr> ...
## $ `In general is it rude to bring a baby on a plane?`                                                                                       <chr> ...
## $ `In general is it rude to knowingly bring unruly children on a plane?`                                                                    <chr> ...
## $ `Have you ever used personal electronics during take off or landing in violation of a flight attendant s direction?`                      <chr> ...
## $ `Have you ever smoked a cigarette in an airplane bathroom when it was against the rules?`                                                 <chr> ...
## $ Gender                                                                                                                                    <chr> ...
## $ Age                                                                                                                                       <chr> ...
## $ `Household Income`                                                                                                                        <chr> ...
## $ Education                                                                                                                                 <chr> ...
## $ `Location Census Region`                                                                                                                  <chr> ...
gathered_data <- flying_etiquette %>%
    mutate_if(is.character, as.factor) %>%
    filter(`How often do you travel by plane?` != "Never") %>%
    # Select columns containing "rude"
    select(contains("rude")) %>%
    # Change format from wide to long
    gather(response_var, value)
## Warning: attributes are not identical across measure variables;
## they will be dropped
rude_behaviors <- gathered_data %>%
    mutate(response_var = str_replace(response_var, '.*rude to ', '')) %>%
    mutate(response_var = str_replace(response_var, 'on a plane', '')) %>%
    mutate(rude = if_else(value %in% c("No, not rude at all", "No, not at all rude"), 0, 1)) %>%
    # Create perc_rude, the percent considering each behavior rude
    group_by(response_var) %>%
    summarize(perc_rude=mean(rude))

rude_behaviors
## # A tibble: 9 x 2
##   response_var                                                    perc_rude
##   <chr>                                                               <dbl>
## 1 ask someone to switch seats with you in order to be closer to ~     0.193
## 2 ask someone to switch seats with you in order to be closer to ~     0.278
## 3 bring a baby ?                                                      0.323
## 4 knowingly bring unruly children ?                                   0.832
## 5 move to an unsold seat ?                                            0.211
## 6 recline your seat ?                                                 0.426
## 7 say more than a few words tothe stranger sitting next to you ?      0.228
## 8 wake a passenger up if you are trying to go to the bathroom?        0.388
## 9 wake a passenger up if you are trying to walk around?               0.741
# Create an ordered by plot of behavior by percentage considering it rude
initial_plot <- ggplot(rude_behaviors, aes(x=fct_reorder(response_var, perc_rude), y=perc_rude)) +
geom_col()

# View your plot
initial_plot

titled_plot <- initial_plot + 
    # Add the title, subtitle, and caption
    labs(title = "Hell Is Other People In A Pressurized Metal Tube",
         subtitle = "Percentage of 874 air-passenger respondents who said action is very or somewhat rude",
         caption = "Source: SurveyMonkey Audience", 
         # Remove the x- and y-axis labels
         x="",
         y=""
         ) 

titled_plot

flipped_plot <- titled_plot + 
    # Flip the axes
    coord_flip() + 
    # Remove the x-axis ticks and labels
    theme(axis.ticks.x = element_blank(), 
        axis.text.x = element_blank())

flipped_plot + 
    # Add labels above the bar with the perc value
    geom_text(aes(label = paste0(round(100*perc_rude), "%"), y = perc_rude + .03), 
              position = position_dodge(0.9), vjust = 1)


Bayesian Modeling with RJAGS

Chapter 1 - Introduction to Bayesian Modeling

Prior model:

  • Goals include foundational Bayesian models such as Beta-Binomial, Normal-Normal, and Bayesian regression
    • Define, compile, simulate using RJAGS
    • Conduct Bayesian posterior inference using RJAGS
  • Example of election poll - there is some uncertainty around the polling figures that have been released
    • An additional poll might tend to update the prior model that was built using the previous elections data
    • Bayesian posterior models are a powerful means of combining priors and data
  • The prior model depends on some specific assumptions and notations
    • Suppose that p is the percentage of people who support you - assumed to be a random variable between 0 and 1
    • The prior model for p is p ~ Beta(45, 55)
    • The beta model can be tuned from pessimism Beta(1, 5) to complete uncertainty Beta(1, 1)

Data and likelihood:

  • Suppose that a candidate running for election does a small poll and finds 6 of 10 plan to vote for them
    • Can integrate even a small data sample like this with assumptions and priors
    • Assumptions might include that voters are independent and p is a global probability that any given voter supports you
    • Then, X ~ Bin(n, p) which is to say that X can be defined as the number of n polled voters that support you (assuming per above global probability p)
  • There is a dependence between X, n, and p, thus the data can help to assess how likely each of the values of p may be given that you observed X in n
    • The likelihood function is the likelihood of observing X given that p takes on a specific value

Posterior model:

  • The prior and the likelihood can be integrated to form the posterior
    • The prior is knowledge that exists before data and the likelihood is what exists based on the data
    • Multiply prior and likelihood and then scale so probabilities add to 1
  • The RJAGS package combines R with JAGS (Just Another Gibbs Sampler) - requires downloading JAGS and then loading rjags
  • Can define the model within RJAGS, for example
    • vote_model <- “model{
    • # Likelihood model for X
    • X ~ dbin(p, n) # order of n and p are reversed (known difference for RJAGS)
    • # Prior model for p
    • p ~ dbeta(a, b)
    • }"
    • vote_jags_A <- jags.model(textConnection(vote_model), data = list(a = 45, b = 55, X = 6, n = 10), inits = list(.RNG.name = “base::Wichmann-Hill”, .RNG.seed = 100))
    • vote_sim <- coda.samples(model = vote_jags, variable.names = c(“p”), n.iter = 10000) # variable.names is the variable of interest, which is p in this case
    • plot(vote_sim, trace = FALSE)

Example code includes:

# Make sure you have installed JAGS-4.x.y.exe (for any x >=0, y>=0) from http://www.sourceforge.net/projects/mcmc-jags/files

# Sample 10000 draws from Beta(45,55) prior
prior_A <- rbeta(n = 10000, shape1 = 45, shape2 = 55)

# Store the results in a data frame
prior_sim <- data.frame(prior_A)

# Construct a density plot of the prior sample
ggplot(prior_sim, aes(x = prior_A)) + 
    geom_density()    


# Sample 10000 draws from the Beta(1,1) prior
prior_B <- rbeta(n = 10000, shape1 = 1, shape2 = 1)    

# Sample 10000 draws from the Beta(100,100) prior
prior_C <- rbeta(n = 10000, shape1 = 100, shape2 = 100)

# Combine the results in a single data frame
prior_sim <- data.frame(samples = c(prior_A, prior_B, prior_C),
        priors = rep(c("A","B","C"), each = 10000))

# Plot the 3 priors
ggplot(prior_sim, aes(x = samples, fill = priors)) + 
    geom_density(alpha = 0.5)


# Define a vector of 1000 p values    
p_grid <- seq(0, 1, length.out=1000)

# Simulate 1 poll result for each p in p_grid   
poll_result <- rbinom(1000, 10, prob=p_grid)

# Create likelihood_sim data frame
likelihood_sim <- data.frame(p_grid, poll_result)    

# Density plots of p_grid grouped by poll_result
ggplot(likelihood_sim, aes(x = p_grid, y = poll_result, group = poll_result)) + 
    ggridges::geom_density_ridges()


# Density plots of p_grid grouped by poll_result
ggplot(likelihood_sim, aes(x = p_grid, y = poll_result, group = poll_result, fill = poll_result==6)) + 
    ggridges::geom_density_ridges()

# Keep the polls with X = 6    
likelihood_sim_6 <- likelihood_sim %>%     
    filter(poll_result==6)    

# Construct a density plot of the remaining p_grid values
ggplot(likelihood_sim_6, aes(x = p_grid)) + 
    geom_density() + 
    lims(x = c(0,1))


# DEFINE the model
vote_model <- "model{
    # Likelihood model for X
    X ~ dbin(p, n)
    
    # Prior model for p
    p ~ dbeta(a, b)
}"

# COMPILE the model    
vote_jags <- jags.model(textConnection(vote_model), 
    data = list(a = 45, b = 55, X = 6, n = 10),
    inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 100))

# SIMULATE the posterior
vote_sim <- coda.samples(model = vote_jags, variable.names = c("p"), n.iter = 10000)

# PLOT the posterior
plot(vote_sim, trace = FALSE)


# COMPILE the model    
vote_jags <- jags.model(textConnection(vote_model), 
    data = list(a = 1, b = 1, X = 6, n = 10),
    inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 100))

# SIMULATE the posterior
vote_sim <- coda.samples(model = vote_jags, variable.names = c("p"), n.iter = 10000)

# PLOT the posterior
plot(vote_sim, trace = FALSE, xlim = c(0,1), ylim = c(0,18))


# COMPILE the model    
vote_jags <- jags.model(textConnection(vote_model), 
    data = list(a = 1, b = 1, X = 220, n = 400),
    inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 100))

# SIMULATE the posterior
vote_sim <- coda.samples(model = vote_jags, variable.names = c("p"), n.iter = 10000)

# PLOT the posterior
plot(vote_sim, trace = FALSE, xlim = c(0,1), ylim = c(0,18))


# COMPILE the model    
vote_jags <- jags.model(textConnection(vote_model), 
    data = list(a = 45, b = 55, X = 220, n = 400),
    inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 100))

# SIMULATE the posterior
vote_sim <- coda.samples(model = vote_jags, variable.names = c("p"), n.iter = 10000)

# PLOT the posterior
plot(vote_sim, trace = FALSE, xlim = c(0,1), ylim = c(0,18))

Chapter 2 - Bayesian Models and Markov Chains

Normal-Normal Model:

  • Example of reaction times in a sleep deprivation study
    • Y(i) ~ N(m, s2) meaning change in reaction time for subject I, assumed to be normally distributed as N(m, s2)
    • Prior information is that normal reaction time is 250 and is expected to increase by 0-150 after sleep deprivation (scale of the prior)
    • The prior for the mean might then be defined as 50 ms increase in reaction time with a standard deviation of 25
    • The prior for the standard deviation might then be uniform on 0-200
  • Overall formulation of the sleep study model includes
    • Y(i) ~ N(m, s**2)
    • m ~ N(50, 25**2)
    • s ~ Unif(0, 200)

Simulating Normal-Normal in RJAGS:

  • Posterior insights are based on the product of the prior and the data, which can be simulated using RJAGS
    • sleep_model <- “model{
    • # Likelihood model for Y[i]
    • for(i in 1:length(Y)) {
    •   Y[i] ~ dnorm(m, s^(-2))  # requires precision which can be defined as the inverse of sigma-squared  
    • }
    • # Prior models for m and s
    • m ~ dnorm(50, 25^(-2)) # requires precision which can be defined as the inverse of sigma-squared
    • s ~ dunif(0, 200)
    • }"
    • sleep_jags <- jags.model(textConnection(sleep_model), data = list(Y = sleep_study$diff_3), inits = list(.RNG.name = “base::Wichmann-Hill”, .RNG.seed = 1989))
    • sleep_sim <- coda.samples(model = sleep_jags, variable.names = c(“m”, “s”), n.iter = 10000)

Markov chains:

  • The RJAGS approach is approximating parameters based on Monte Carlo Markov Chains (MCMC)
    • The goal of RJAGS is to use Markov chains to estimate (approximate) posteriors that would otherwise be too complicated to model
  • Each iteration of a Markov chain depends on the previous iteration (the iterations are not entirely random or independent)
    • Over time, the Markov chain explores the sample space, but the exploration is often over a smaller range for smaller intervals (there are auto-corelations)
    • The overall distribution of the Markov chain mimics a random sampling drawn from the posterior distribution

Markov chain diagnostics and reproducibility:

  • The trace plots indicate the longitudinal behavior of the chain while the density plots indicate the distribution of the chain
  • Questions about what makes for a good chain (convergence, trials needed, etc.)
    • Stability is good - long-run trends should be stabilized
    • Multiple parallel chains should return very similar results with similar features (should not be overly dependent on RNG/seed)
    • summary(sleep_sim) # provides key Markov chain diagnostics
  • Generally, problems with Markov chains can be addressed with longer chains (more iterations)
  • Helpful to set the seed and RNG in the model for reproducibility
    • inits = list(.RNG.name = “base::Wichmann-Hill”, .RNG.seed = 1989) # example inside jags.model()

Example code includes:

# Take 10000 samples from the m prior
prior_m <- rnorm(10000, 50, 25)

# Take 10000 samples from the s prior    
prior_s <- runif(10000, 0, 200)

# Store samples in a data frame
samples <- data.frame(prior_m, prior_s)

# Density plots of the prior_m & prior_s samples    
ggplot(samples, aes(x = prior_m)) + 
    geom_density()
ggplot(samples, aes(x = prior_s)) + 
    geom_density()


# Check out the first 6 rows of sleep_study
head(sleep_study)

# Define diff_3
sleep_study <- sleep_study %>%
  mutate(diff_3=day_3-day_0)

# Histogram of diff_3    
ggplot(sleep_study, aes(x = diff_3)) + 
    geom_histogram(binwidth = 20, color = "white")

# Mean and standard deviation of diff_3    
sleep_study %>%
  summarize(mean(diff_3), sd(diff_3))


# DEFINE the model    
sleep_model <- "model{
    # Likelihood model for Y[i]
    for(i in 1:length(Y)) {
        Y[i] ~ dnorm(m, s^(-2))
    }

    # Prior models for m and s
    m ~ dnorm(50, 25^(-2))
    s ~ dunif(0, 200)
}"    

# COMPILE the model
sleep_jags <- jags.model(textConnection(sleep_model), data = list(Y = sleep_study$diff_3),
    inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 1989))    

# SIMULATE the posterior    
sleep_sim <- coda.samples(model = sleep_jags, variable.names = c("m", "s"), n.iter = 10000)

# PLOT the posterior    
plot(sleep_sim, trace = FALSE)    


# Let m be the average change in reaction time after 3 days of sleep deprivation
# In a previous exercise, you obtained an approximate sample of 10,000 draws from the posterior model of m
# You stored the resulting mcmc.list object as sleep_sim which is loaded in your workspace:
# In fact, the sample of m values in sleep_sim is a dependent Markov chain, the distribution of which converges to the posterior
# You will examine the contents of sleep_sim and, to have finer control over your analysis, store the contents in a data frame

# Check out the head of sleep_sim
head(sleep_sim)

# Store the chains in a data frame
sleep_chains <- data.frame(sleep_sim[[1]], iter = 1:10000)

# Check out the head of sleep_chains
head(sleep_chains)


# NOTE: The 10,000 recorded Iterations start after a "burn-in" period in which samples are discarded
# Thus the Iterations count doesn't start at 1!

# Use plot() to construct trace plots of the m and s chains
plot(sleep_sim, density=FALSE)

# Use ggplot() to construct a trace plot of the m chain
ggplot(sleep_chains, aes(x = iter, y = m)) + 
    geom_line()

# Trace plot the first 100 iterations of the m chain
ggplot(dplyr::filter(sleep_chains, iter<=100), aes(x = iter, y = m)) + geom_line()

# Note that the longitudinal behavior of the chain appears quite random and that the trend remains relatively constant
# This is a good thing - it indicates that the Markov chain (likely) converges quickly to the posterior distribution of m


# Use plot() to construct density plots of the m and s chains
plot(sleep_sim, trace=FALSE)

# Use ggplot() to construct a density plot of the m chain
ggplot(sleep_chains, aes(x = m)) + 
    geom_density()

# Density plot of the first 100 values in the m chain
ggplot(dplyr::filter(sleep_chains, iter<=100), aes(x = m)) + 
    geom_density()


# COMPILE the model
sleep_jags_multi <- jags.model(textConnection(sleep_model), data = list(Y = sleep_study$diff_3), n.chains=4)   

# SIMULATE the posterior    
sleep_sim_multi <- coda.samples(model = sleep_jags_multi, variable.names = c("m", "s"), n.iter = 1000)

# Check out the head of sleep_sim_multi
head(sleep_sim_multi)

# Construct trace plots of the m and s chains
plot(sleep_sim_multi, density=FALSE)


# The mean of the m Markov chain provides an estimate of the posterior mean of m
# The naive standard error provides a measure of the estimate's accuracy.

# Suppose your goal is to estimate the posterior mean of m within a standard error of 0.1 ms
# If the observed naive standard error exceeds this target, no problem!
# You can simply run a longer chain


# SIMULATE the posterior    
sleep_sim_1 <- coda.samples(model = sleep_jags, variable.names = c("m", "s"), n.iter = 1000)

# Summarize the m and s chains of sleep_sim_1
summary(sleep_sim_1)

# RE-SIMULATE the posterior    
sleep_sim_2 <- coda.samples(model = sleep_jags, variable.names = c("m", "s"), n.iter = 10000)

# Summarize the m and s chains of sleep_sim_2
summary(sleep_sim_2)


# COMPILE the model
sleep_jags <- jags.model(textConnection(sleep_model), data = list(Y = sleep_study$diff_3), inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 1989)) 

# SIMULATE the posterior    
sleep_sim <- coda.samples(model = sleep_jags, variable.names = c("m", "s"), n.iter = 10000)

# Summarize the m and s chains of sleep_sim
summary(sleep_sim)

Chapter 3 - Bayesian Inference and Prediction

Simple Bayesian Regression Model:

  • The simple Bayesian regression lays the ground work for more complicated modeling built on it
  • Suppose that the goal is to model human weights, and that they are N(m, s**2)
    • Y(i) ~ N(m(i), s**2) # m(i) is an average weight that depends on height
    • m(i) = a + b*X(i) where X(i) is the height of individual i
  • Can specify Bayesian model with priors
    • a = intercept ~ N(0, 200**2)
    • b = slope (expected to be positive) ~ N(1, 0.5**2)
    • s = residual standard deviation ~ Unif(0, 20)

Bayesian Regression in RJAGS:

  • The basic lm() regression will give the parameters based on linear regression
  • Can instead define the Bayesian linear regression for RJAGS
    • Within RJAGS, [i] means that it varies for each subject, i
    • Within RJAGS, <- means there is an exact mathematical formula that does not need to be estimated
    • weight_model <- “model{
    • # Likelihood model for Y[i]
    • for(i in 1:length(Y)) {
    •   Y[i] ~ dnorm(m[i], s^(-2))  
    •   m[i] <- a + b * X[i]  
    • }
    • # Prior models for a, b, s
    • a ~ dnorm(0, 200^(-2))
    • b ~ dnorm(1, 0.5^(-2))
    • s ~ dunif(0, 20)
    • }"
    • weight_jags <- jags.model(textConnection(weight_model), data = list(X = bdims\(hgt, Y = bdims\)wgt), inits = list(.RNG.name = “base::Wichmann-Hill”, .RNG.seed = 2018))
    • weight_sim <- coda.samples(model = weight_jags, variable.names = c(“a”, “b”, “s”), n.iter = 10000)
  • Options for addressing the Markov chain instability include
    • Standardize the height predictor
    • Increase the chain length

Posterior estimation and inference:

  • Bayesian regression using RJAGS provided estimates for the slope and intercept parameters
  • The posterior densities can be summarized for better communication - for example, point estimates based on means
  • Can also plot the lines corresponding to each pair of slope, intercepts - can further create “credible intervals”
    • The 95% credible intervals are the middle 95% of the densities for each of the parameters (slope and intercept)
  • Can also assess frequencies of exceedence for parameters
    • table(weight_chains$b > 1.1)

Posterior prediction:

  • Based on simulations, the posterior mean trend was estimated
    • Can use the final coefficients to make estimates about the population
    • Could instead model the regression based on each set of coefficients included in the chain, including calculating the credible interval
  • Rather than using the regression to find means, the goal may be to predict an individual
    • Plug in the data as per finding the mean
    • The residual standard deviation (s) can then be used inside chain, using all of the a, b, s, data

Example code includes:

# Note the 3 parameters in the model of weight by height: intercept a, slope b, & standard deviation s
# In the first step of your Bayesian analysis, you will simulate the following prior models for these parameters: a ~ N(0, 200^2), b ~ N(1, 0.5^2), and s ~ Unif(0, 20)

# Take 10000 samples from the a, b, & s priors
prior_a <- rnorm(10000, 0, 200)
prior_b <- rnorm(10000, 1, 0.5)
prior_s <- runif(10000, 0, 20)

# Store samples in a data frame
samples <- data.frame(prior_a, prior_b, prior_s, set=1:10000)

# Construct density plots of the prior samples    
ggplot(samples, aes(x = prior_a)) + 
    geom_density()
ggplot(samples, aes(x = prior_b)) + 
    geom_density()
ggplot(samples, aes(x = prior_s)) + 
    geom_density()


# Replicate the first 12 parameter sets 50 times each
prior_scenarios_rep <- bind_rows(replicate(n = 50, expr = samples[1:12, ], simplify = FALSE)) 

# Simulate 50 height & weight data points for each parameter set
prior_simulation <- prior_scenarios_rep %>% 
    mutate(height = rnorm(600, 170, 10)) %>% 
    mutate(weight = rnorm(600, prior_a + prior_b*height, prior_s))

# Plot the simulated data & regression model for each parameter set
ggplot(prior_simulation, aes(x = height, y = weight)) + 
    geom_point() + 
    geom_smooth(method = "lm", se = FALSE, size = 0.75) + 
    facet_wrap(~ set)


# The bdims data set from the openintro package is loaded in your workspace
# bdims contains physical measurements on a sample of 507 individuals, including their weight in kg (wgt) and height in cm (hgt)

# Construct a scatterplot of wgt vs hgt
ggplot(bdims, aes(x = hgt, y = wgt)) + 
    geom_point()

# Add a model smooth
ggplot(bdims, aes(x = hgt, y = wgt)) + 
    geom_point() + 
    geom_smooth(method = "lm", se = FALSE)
    
# Obtain the sample regression model
wt_model <- lm(wgt ~ hgt, data = bdims)

# Summarize the model
summary(wt_model)


# DEFINE the model    
weight_model <- "model{
    # Likelihood model for Y[i]
    for(i in 1:length(Y)) {
        Y[i] ~ dnorm(m[i], s^(-2))
        m[i] <- a + b * X[i]
    }

    # Prior models for a, b, s
    a ~ dnorm(0, 200^(-2))
    b ~ dnorm(1, 0.5^(-2))
    s ~ dunif(0, 20)
}"

# COMPILE the model
weight_jags <- jags.model(textConnection(weight_model), data = list(X=bdims$hgt, Y=bdims$wgt), 
                  inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 1989))

# COMPILE the model
weight_jags <- jags.model(textConnection(weight_model), data = list(Y = bdims$wgt, X = bdims$hgt), 
                          inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 1989))

# SIMULATE the posterior    
weight_sim <- coda.samples(model = weight_jags, variable.names = c("a", "b", "s"), n.iter = 1000)

# PLOT the posterior    
plot(weight_sim)


# A 100,000 iteration RJAGS simulation of the posterior, weight_sim_big, is in your workspace along with a data frame of the Markov chain output:
head(weight_chains, 2)

# The posterior means of the intercept & slope parameters, a & b, reflect the posterior mean trend in the relationship between weight & height
# In contrast, the full posteriors of a & b reflect the range of plausible parameters, thus posterior uncertainty in the trend
# You will examine the trend and uncertainty in this trend below
# The bdims data are in your workspace

# Summarize the posterior Markov chains
summary(weight_sim_big)

# Calculate the estimated posterior mean of b
mean(weight_chains$b)

# Plot the posterior mean regression model
ggplot(bdims, aes(x=hgt, y=wgt)) + 
    geom_point() + 
    geom_abline(intercept = mean(weight_chains$a), slope = mean(weight_chains$b), color = "red")

# Visualize the range of 20 posterior regression models
ggplot(bdims, aes(x=hgt, y=wgt)) + 
    geom_point() + 
    geom_abline(intercept = weight_chains$a[1:20], slope = weight_chains$b[1:20], color = "gray", size = 0.25)


# Summarize the posterior Markov chains
summary(weight_sim_big)

# Calculate the 95% posterior credible interval for b
quantile(weight_chains$b, c(0.025, 0.975))

# Calculate the 90% posterior credible interval for b
quantile(weight_chains$b, c(0.05, 0.95))

# Mark the 90% credible interval 
ggplot(weight_chains, aes(x = b)) + 
    geom_density() + 
    geom_vline(xintercept = quantile(weight_chains$b, c(0.05, 0.95)), color = "red")


# Mark 1.1 on a posterior density plot for b
ggplot(weight_chains, aes(x=b)) + 
    geom_density() + 
    geom_vline(xintercept = 1.1, color = "red")

# Summarize the number of b chain values that exceed 1.1
table(weight_chains$b > 1.1)

# Calculate the proportion of b chain values that exceed 1.1 
mean(weight_chains$b > 1.1)


# Calculate the trend under each Markov chain parameter set
weight_chains <- weight_chains %>% 
    mutate(m_180 = a + b*180)

# Construct a posterior density plot of the trend
ggplot(weight_chains, aes(x = m_180)) + 
    geom_density() 

# Calculate the average trend
mean(weight_chains$m_180)

# Construct a posterior credible interval for the trend
quantile(weight_chains$m_180, c(0.025, 0.975))


# Simulate 1 prediction under the first parameter set
rnorm(1, mean=weight_chains$m_180[1], sd=weight_chains$s[1])

# Simulate 1 prediction under the second parameter set
rnorm(1, mean=weight_chains$m_180[2], sd=weight_chains$s[2])

# Simulate & store 1 prediction under each parameter set
weight_chains <- weight_chains  %>% 
    mutate(Y_180=rnorm(nrow(weight_chains), mean=m_180, sd=s))

# Print the first 6 parameter sets & predictions
head(weight_chains)


# Construct a density plot of the posterior predictions
ggplot(weight_chains, aes(x=Y_180)) + 
    geom_density() + 
    geom_vline(xintercept = quantile(weight_chains$Y_180, c(0.025, 0.975)), color = "red")

# Construct a posterior credible interval for the prediction
quantile(weight_chains$Y_180, c(0.025, 0.975))

# Visualize the credible on a scatterplot of the data
ggplot(bdims, aes(x=hgt, y=wgt)) + 
    geom_point() + 
    geom_abline(intercept = mean(weight_chains$a), slope = mean(weight_chains$b), color = "red") + 
    geom_segment(x = 180, xend = 180, y = quantile(weight_chains$Y_180, c(0.025)), yend = quantile(weight_chains$Y_180, c(0.975)), color = "red")

Chapter 4 - Multivariate and Generalized Linear Models

Bayesian regression with categorical predictor:

  • Can incorporate categorical predictors in to the Bayesian regressions
  • Example of usage of a rail-trail in MA
    • Y[i] ~ N(m[i], s**2) where [i] is the day - assumed to have varying means but constant standard deviations
    • X[i] is a 1/0 variable where 1 is for weekdays and 0 is for weekends
    • m[i] = a + b*X[i], meaning that a is the typical weekend volume and a+b is the typical weekday volume
    • The prior will be a ~ N(400, 1002) and b ~ N(0, 2002) and s ~ Unif(0, 200)
  • Can then define the model using RJAGS
    • rail_model_1 <- “model{
    • # Likelihood model for Y[i]
    • for(i in 1:length(Y)) {
    •   Y[i] ~ dnorm(m[i], s^(-2))  
    •   m[i] <- a + b[X[i]]  
    • }
    • # Prior models for a, b, s
    • a ~ dnorm(400, 100^(-2))
    • s ~ dunif(0, 200)
    • b[1] <- 0
    • b[2] ~ dnorm(0, 200^(-2))
    • }"
    • Note that b[1] <- 0 is because we want m[i] = a for the reference level; b[2], the second level, is what we want to model

Multivariate Bayesian regression:

  • Bayesian models can be generalized to multivariate models, for example
    • Y[i] ~ N(m[i], s**2) where [i] is the day - assumed to have varying means but constant standard deviations
    • X[i] is a 1/0 variable where 1 is for weekdays and 0 is for weekends
    • Z[i] is the high temperatue on day [i] in degrees F
    • m[i] = a + bX[i] + cZ[i]
    • a ~ N(0, 2002), b ~ N(0, 2002), c ~ N(0, 20**2), s ~ Unif(0, 200)
  • Can then define and simulate this model using RJAGS
    • rail_model_2 <- “model{
    • # Likelihood model for Y[i]
    • for(i in 1:length(Y)) {
    •   Y[i] ~ dnorm(m[i], s^(-2))  
    •   m[i] <- a + b[X[i]] + c * Z[i]  
    • }
    • # Prior models for a, b, c, s
    • a ~ dnorm(0, 200^(-2))
    • b[1] <- 0
    • b[2] ~ dnorm(0, 200^(-2))
    • c ~ dnorm(0, 20^(-2))
    • s ~ dunif(0, 200)
    • }"

Bayesian Poisson regression:

  • Can generalize regression techniques to non-normalized settings, such as Poisson regressions
  • Bicycle riders per day might better be modeled as a Poisson - should be discrete and non-negative, for example
    • Y ~ Pois(lambda[i])
    • log(lambda[i]) = a + b * X[i] + c * Z[i]
    • a ~ N(0, 200**2)
    • b ~ N(0, 2**2)
    • c ~ N(0, 2**2)
  • Can then define the model within RJAGS
    • poisson_model <- “model{
    • # Likelihood model for Y[i]
    • for(i in 1:length(Y)) {
    •   Y[i] ~ dpois(l[i])  
    •   log(l[i]) <- a + b[X[i]] + c*Z[i]  
    • }
    • # Prior models for a, b, c
    • a ~ dnorm(0, 200^(-2))
    • b[1] <- 0
    • b[2] ~ dnorm(0, 2^(-2))
    • c ~ dnorm(0, 2^(-2))
    • }"
  • Caveat for the Poisson is the mean and variance should be roughly equal; might accept some imperfections of dispersions

Wrap up:

  • Bayesian modeling has grown in popularity along with computing resources
  • RJAGS allows for defining, compiling, and simulating Bayesian models
  • Intutive posterior inference, including credible intervals
  • Generalizing from normal models to Poisson models

Example code includes:

# Confirm that weekday is a factor variable
is.factor(RailTrail$weekday)

# Construct a density plot of volume by weekday
ggplot(RailTrail, aes(x = volume, fill = weekday)) + 
    geom_density(alpha = 0.5)

# Calculate the mean volume on weekdays vs weekends
RailTrail %>%
  group_by(weekday) %>%
  summarize(mean(volume))


# DEFINE the model    
rail_model_1 <- "model{
    # Likelihood model for Y[i]
    for(i in 1:length(Y)) {
      Y[i] ~ dnorm(m[i], s^(-2))
      m[i] <- a + b[X[i]]
    }
    
    # Prior models for a, b, s
    a ~ dnorm(400, 100^(-2))
    b[1] <- 0
    b[2] ~ dnorm(0, 200^(-2))
    s ~ dunif(0, 200)
}"

# COMPILE the model
rail_jags_1 <- jags.model(textConnection(rail_model_1), 
    data = list(Y=RailTrail$volume, X=RailTrail$weekday),
    inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 10)
    )  

# COMPILE the model
rail_jags_1 <- jags.model(textConnection(rail_model_1), data = list(Y = RailTrail$volume, X = RailTrail$weekday), 
    inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 10))

# SIMULATE the posterior    
rail_sim_1 <- coda.samples(model = rail_jags_1, variable.names = c("a", "b", "s"), n.iter = 10000)

# Store the chains in a data frame
rail_chains_1 <- data.frame(rail_sim_1[[1]])

# PLOT the posterior    
plot(rail_sim_1)


# Posterior probability that typical volume is lower on weekdays
mean(rail_chains_1$'b.2.' < 0)

# Construct a chain of values for the typical weekday volume
rail_chains_1 <- rail_chains_1 %>% 
    mutate(weekday_mean = a + b.2.)

# Construct a density plot of the weekday chain
ggplot(rail_chains_1, aes(x=weekday_mean)) +
  geom_density()

# 95% credible interval for typical weekday volume
quantile(rail_chains_1$weekday_mean, c(0.025, 0.975))


# Construct a plot of volume by hightemp & weekday
ggplot(RailTrail, aes(x=hightemp, y=volume, color=weekday)) + 
    geom_point()

# Construct a sample model
rail_lm <- lm(volume ~ weekday + hightemp, data=RailTrail)

# Summarize the model
summary(rail_lm)

# Superimpose sample estimates of the model lines
ggplot(RailTrail, aes(x=hightemp, y=volume, color=weekday)) + 
    geom_point() + 
    geom_abline(intercept = coef(rail_lm)["(Intercept)"], slope = coef(rail_lm)["hightemp"], color = "red") +
    geom_abline(intercept = sum(coef(rail_lm)[c("(Intercept)", "weekdayTRUE")]), slope = coef(rail_lm)["hightemp"], color = "turquoise3")


# DEFINE the model    
rail_model_2 <- "model{
  # Likelihood model for Y[i]
  for(i in 1:length(Y)){
    Y[i] ~ dnorm(m[i], s^(-2))
    m[i] <- a + b[X[i]] + c * Z[i]
  }
    
  # Prior models for a, b, c, s
  a ~ dnorm(0, 200^(-2))
  b[1] <- 0
  b[2] ~ dnorm(0, 200^(-2))
  c ~ dnorm(0, 20^(-2))
  s ~ dunif(0, 200)
}"

# COMPILE the model
rail_jags_2 <- jags.model(textConnection(rail_model_2), 
                          data = list(Y=RailTrail$volume, X=RailTrail$weekday, Z=RailTrail$hightemp),
                          inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 10)
                          )

# SIMULATE the posterior    
rail_sim_2 <- coda.samples(model = rail_jags_2, variable.names = c("a", "b", "c", "s"), n.iter = 10000)

# Store the chains in a data frame
rail_chains_2 <- data.frame(rail_sim_2[[1]])

# PLOT the posterior    
plot(rail_sim_2)


# Summarize the posterior Markov chains
summary(rail_sim_2)

# Plot the posterior mean regression models
ggplot(RailTrail, aes(x=hightemp, y=volume, color=weekday)) + 
    geom_point() + 
    geom_abline(intercept = mean(rail_chains_2[, "a"]), slope = mean(rail_chains_2[, "c"]), color = "red") + 
    geom_abline(intercept = mean(rail_chains_2[, "a"]) + mean(rail_chains_2[, "b.2."]), slope = mean(rail_chains_2[, "c"]), color = "turquoise3")
  
# Posterior probability that typical volume is lower on weekdays
mean(rail_chains_2$'b.2.' < 0)


# DEFINE the model    
poisson_model <- "model{
    # Likelihood model for Y[i]
    for(i in 1:length(Y)) {
        Y[i] ~ dpois(l[i])
        log(l[i]) <- a + b[X[i]] + c * Z[i]
    }

    # Prior models for a, b, c
    a ~ dnorm(0, 200^(-2))
    b[1] <- 0
    b[2] ~ dnorm(0, 2^(-2))
    c ~ dnorm(0, 2^(-2))
}" 

# COMPILE the model
poisson_jags <- jags.model(textConnection(poisson_model), 
                           data = list(Y=RailTrail$volume, X=RailTrail$weekday, Z=RailTrail$hightemp),
                           inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 10)
                           )

# SIMULATE the posterior    
poisson_sim <- coda.samples(model = poisson_jags, variable.names = c("a", "b", "c"), n.iter = 10000)

# Store the chains in a data frame
poisson_chains <- data.frame(poisson_sim[[1]])

# PLOT the posterior    
plot(poisson_sim)


# Summarize the posterior Markov chains
summary(poisson_sim)

# Plot the posterior mean regression models
ggplot(RailTrail, aes(x = hightemp, y = volume, color = weekday)) + 
    geom_point() + 
    stat_function(fun = function(x){exp(5.01352 + 0.01426 * x)}, color = "red") + 
    stat_function(fun = function(x){exp(5.01352 - 0.12800 + 0.01426 * x)}, color = "turquoise3")


# Calculate the typical volume on 80 degree weekends & 80 degree weekdays
poisson_chains <- poisson_chains %>% 
    mutate(l_weekend=exp(a + c*80)) %>% 
    mutate(l_weekday=exp(a + b.2. + c*80))

# Construct a 95% CI for typical volume on 80 degree weekend
quantile(poisson_chains$l_weekend, c(0.025, 0.975))

# Construct a 95% CI for typical volume on 80 degree weekday
quantile(poisson_chains$l_weekday, c(0.025, 0.975))


# Simulate weekend & weekday predictions under each parameter set
poisson_chains <- poisson_chains %>% 
    mutate(Y_weekend=rpois(nrow(poisson_chains), l_weekend)) %>% 
    mutate(Y_weekday=rpois(nrow(poisson_chains), l_weekday))
    
# Print the first 6 sets of parameter values & predictions
head(poisson_chains)

# Construct a density plot of the posterior weekday predictions
ggplot(poisson_chains, aes(x=Y_weekday)) +
  geom_density()

# Posterior probability that weekday volume is less 400
mean(poisson_chains$Y_weekday < 400)

Parallel Programming in R

Chapter 1 - Can I run my application in parallel?

Partitioning problems in to independent pieces:

  • Course contents include
    • Methods of parallel programming and R packages for support
    • The parallel package in R
    • Packages foreach and future.apply
    • Random numbers and reproducibility
  • Programs can be partitioned either by tasks (e.g., birth model, death model, migration model) or by data (chunks of data passed to a routine, such as rowSums to a matrix)
    • Many independent tasks are referred to as “embarassingly parallel”, and this is common to statistical simulations

Models of parallel computing:

  • Available hardware drives the ability to split components - # CPU, Memory (including shared memory or distributed memory)
    • Message passing software runs on distributed memory and allows for fully independent processes
    • Shared memory allows for easier passing of data
  • Programming paradigms include master-worker and map-reduce (Hadoop or Scala or the like)
  • This course will cover the master-worker model
    • The master process creates processes for the workers and then compiles the results that the workers return

R packages for parallel computing:

  • The R core package parallel allows for code to be independent of other packages
  • Can instead work with iotools and sparklyr for working with the map-reduce process
    • Further, pbdR allows for many parallel approaches within R
  • The master-worker paradigms can be implemented using foreach, future.apply, snow, snowFT, snowfall, future (currently under active development, more modern)
  • The parallel package can be used for basic parallel tasks
    • ncores <- parallel::detectCores(logical = FALSE)
    • cl <- parallel::makeCluster(ncores)
    • parallel::clusterApply(cl, x = ncores:1, fun = rnorm) # x is passed to the workers in order, so worker 1 will get ronorn(ncores)
    • parallel::stopCluster(cl)

Example code includes:

extract_words <- function(book_name) {
    # extract the text of the book
    text <- subset(austen_books(), book == book_name)$text
    # extract words from the text and convert to lowercase
    str_extract_all(text, boundary("word")) %>% unlist %>% tolower
}

janeausten_words <- function() {
    # Names of the six books contained in janeaustenr
    books <- austen_books()$book %>% unique %>% as.character
    # Vector of words from all six books
    words <- sapply(books, extract_words) %>% unlist
    words
}

austen_books <- function () 
{
    books <- list('Sense & Sensibility' = janeaustenr::sensesensibility, 
                  'Pride & Prejudice' = janeaustenr::prideprejudice, 
                  'Mansfield Park' = janeaustenr::mansfieldpark, 
                  'Emma' = janeaustenr::emma, 
                  'Northanger Abbey' = janeaustenr::northangerabbey, 
                  'Persuasion' = janeaustenr::persuasion
                  )
    ret <- data.frame(text = unlist(books, use.names = FALSE), stringsAsFactors = FALSE)
    ret$book <- factor(rep(names(books), sapply(books, length)))
    ret$book <- factor(ret$book, levels = unique(ret$book))
    structure(ret, class = c("tbl_df", "tbl", "data.frame"))
}

max_frequency <- function(letter, words, min_length = 1) {
    w <- select_words(letter, words = words, min_length = min_length)
    frequency <- table(w)    
    frequency[which.max(frequency)]
}

select_words <- function(letter, words, min_length = 1) {
    min_length_words <- words[nchar(words) >= min_length]
    grep(paste0("^", letter), min_length_words, value = TRUE)
}

# Vector of words from all six books
words <- janeausten_words()

# Most frequent "a"-word that is at least 5 chars long
max_frequency(letter = "a", words = words, min_length = 5)
## again 
##  1001
# Partitioning
result <- lapply(letters, FUN=max_frequency,
                 words = words, min_length = 5) %>% unlist()

# barplot of result
barplot(result, las = 2)

replicates <- 50
sample_size <- 10000

# Function that computes mean of normal random numbers
myfunc <- function(n, ...) mean(rnorm(n, ...))

# Init result, set seed & repeat the task sequentially
result <- rep(NA, replicates)
set.seed(123)
for(iter in 1:replicates) result[iter] <- myfunc(sample_size)

# View result
hist(result)

# Use sapply() with different distribution parameters
hist(sapply(rep(sample_size, replicates), FUN=myfunc, mean = 10, sd = 5))

# We'll now introduce a demographic model to be used throughout the course. It projects net migration rates via an AR(1) model, rate(t+1) - µ = ?(rate(t) -µ) + error with variance s2
# An MCMC estimation for the USA resulted in 1000 samples of parameters µ, ? and s
# The task is to project the future distribution of migration rates

ar1_trajectory <- function(est, rate0, len = 15) {
    trajectory <- rep(NA, len)
    rate <- rate0
    for (time in seq_len(len)) {
        trajectory[time] <- ar1(est, r = rate)
        rate <- trajectory[time]
    }
    trajectory
}

ar1 <- function(est, r) {
    est['mu'] + est['phi'] * (r - est['mu']) + 
        rnorm(1, sd = est['sigma'])
}

ar1_block <- function(id, rate0 = 0.015, traj_len = 15, block_size = 10) {
    trajectories <- matrix(NA, nrow = block_size, ncol = traj_len)
    for (i in seq_len(block_size)) 
        trajectories[i,] <- ar1_trajectory(unlist(ar1est[id, ]), rate0 = rate0, len = traj_len)
    trajectories
}

show_migration <- function(trajs) {
    df <- data.frame(time = seq(2020, by = 5, len = ncol(trajs)),
                     migration_rate = apply(trajs, 2, median),
                     lower = apply(trajs, 2, quantile, 0.1),
                     upper = apply(trajs, 2, quantile, 0.9)
                    )
    g <- ggplot(df, aes(x = time, y = migration_rate)) + 
        geom_ribbon(aes(ymin = lower, ymax = upper), fill = "grey70") + 
        geom_line()
    print(g)
}


# Simulate from multiple rows of the estimation dataset
ar1_multblocks <- function(ids, ...) {
    trajectories <- NULL
    for (i in seq_along(ids)) {
        trajectories <- rbind(trajectories, ar1_block(ids[i], ...))
    }
    trajectories
}

ar1est <- data.frame(mu=c(0.0105, 0.0185, 0.022, 0.0113, 0.0144, 0.0175, -9e-04, 0.0093, 0.0111, -9e-04, -0.0024, 0.0086, 0.012, 0.0161, 0.0043, 0.0175, 0.0118, 0.0019, 0.0116, 0.0048, 0.0154, 0.0137, 0.0168, 0.0191, 0.0108, -0.0037, 0.0135, 0.0203, -0.0042, 0.0097, 0.0209, 0.0034, 0.0113, 0.0102, 0.0094, -0.0012, 0.008, 0.0082, 0.0123, 0.0175, 0.0054, -0.0087, 0.0161, 0.0155, 0.0126, 0.0181, 0.014, -0.0135, -0.0095, 0.0142, 0.011, 0.0194, 0.0149, 0.0115, 0.0129, -0.0124, 0.0116, 0.0136, 0.0161, 0.005, 0.0165, -0.0079, 0.0129, -0.0016, -7e-04, 0.0243, 0.0193, -0.004, 0.0145, 0.0078, 0.0156, 0.001, 0.0032, 0.0069, 0.0146, 0.0164, 0.0113, 0.0116, 0.0182, 0.0167, -0.0031, 0.0168, 0.0137, 0.012, -0.0212, -0.0092, 0.019, 0.0167, -0.0021, 0.0156, 0.0173, 0.0148, -0.0036, 0.0168, 0.0179, 0.0086, 0.0131, 0.015, 0.0106, 0.0132, 0.0119, 0.0156, 0.0159, 0.0256, 0.0071, 0.0163, 0.0107, 0.0139, 0.0228, 0.0139, 0.0117, 0.0133, 0.0127, -0.0162, 0.0115, 0.0095, 0.0183, 0.0183, -6e-04, 0.0177, 0.0145, 0.0041, 0.0143, 0.0135, -0.0078, 0.0036, 0.015, 0.018, 0.0158, 0.0054, -0.0204, 0.0193, 0.0051, 0.0144, 0.0129, 0.0134, 0.0116, 0.0102, 0.0203, 0.0154, 0.0106, 0.0184, 0.0096, -0.0032, 0.0143, 0.0158, 0.0093, 0.0159, 0.0112, 0.0106, 0.0075, 0.0133, 0.0171, 0.0133, 0.0139, 0.0167, 0.0131, -0.0078, 0.0135, 0.0145, 0.0104, 8e-04, 0.0205, 0.0046, 0.011, 0.0148, 0.0202, 8e-04, 0.0211, 0.0135, -8e-04, -0.0104, -0.0027, 0.0094, 0.0179, -0.0101, 0.0156, 0.0155, 0.014, 0.0149, 0.0165, 0.0168, 0.0155, 0.0136, 0.0156, 0.0149, 0.0191, 0.0176, 0.0094, -0.0076, 0.0162, 0.0143, 0.0182, 0.0102, 0.015, -0.0292, 0.0063, -0.0028, 0.0163, 0.015), 
                     sigma=c(0.0081, 0.0053, 0.0069, 0.0075, 0.0082, 0.006, 0.0101, 0.011, 0.0064, 0.0066, 0.0095, 0.0057, 0.0078, 0.005, 0.0076, 0.0064, 0.0067, 0.0049, 0.0086, 0.0067, 0.0063, 0.0054, 0.0063, 0.0077, 0.0072, 0.0074, 0.0067, 0.0047, 0.0125, 0.0069, 0.0052, 0.0073, 0.0063, 0.0072, 0.0086, 0.0079, 0.009, 0.006, 0.0077, 0.0061, 0.0082, 0.0072, 0.0054, 0.0056, 0.0072, 0.0085, 0.0064, 0.0058, 0.0064, 0.0084, 0.0075, 0.006, 0.0048, 0.0068, 0.0065, 0.0082, 0.0072, 0.0056, 0.0056, 0.0055, 0.0054, 0.0059, 0.0064, 0.0069, 0.0073, 0.0071, 0.0057, 0.0062, 0.0086, 0.0062, 0.0054, 0.0052, 0.0066, 0.0076, 0.0046, 0.0056, 0.0066, 0.0077, 0.0074, 0.0061, 0.0056, 0.0065, 0.0069, 0.0084, 0.0058, 0.007, 0.0074, 0.0077, 0.0081, 0.0083, 0.0054, 0.0057, 0.0076, 0.0119, 0.0056, 0.0078, 0.005, 0.0073, 0.0075, 0.0054, 0.0085, 0.011, 0.0063, 0.0056, 0.009, 0.0069, 0.008, 0.0063, 0.007, 0.0059, 0.0064, 0.006, 0.0103, 0.0085, 0.006, 0.0076, 0.0054, 0.0066, 0.0056, 0.0071, 0.0079, 0.007, 0.0085, 0.0075, 0.007, 0.0085, 0.006, 0.0067, 0.006, 0.0074, 0.0098, 0.0066, 0.0058, 0.0075, 0.0064, 0.0059, 0.0103, 0.0055, 0.0053, 0.0068, 0.0057, 0.009, 0.0118, 0.0096, 0.0085, 0.0075, 0.0078, 0.0041, 0.0056, 0.008, 0.0071, 0.006, 0.0046, 0.0061, 0.007, 0.0061, 0.0066, 0.0075, 0.0094, 0.0072, 0.008, 0.0064, 0.0079, 0.0068, 0.0069, 0.0058, 0.0056, 0.0057, 0.0065, 0.006, 0.0073, 0.0067, 0.0068, 0.0071, 0.0048, 0.0071, 0.0063, 0.0051, 0.0079, 0.0042, 0.0048, 0.0066, 0.0072, 0.0058, 0.0057, 0.0083, 0.0063, 0.0057, 0.0103, 0.0096, 0.0067, 0.0051, 0.0075, 0.0064, 0.0069, 0.007, 0.007, 0.0074, 0.0056, 0.006), 
                     phi=c(0.42, 0.3509, 0.8197, 0.5304, 0.1491, 0.3675, 0.9687, 0.7877, 0.7114, 0.9435, 0.9634, 0.9189, 0.4758, 0.5738, 0.8016, 0.0509, 0.8281, 0.8168, 0.7442, 0.9347, 0.1699, 0.3566, 0.8388, 0.7724, 0.7474, 0.7834, 0.6661, 0.5162, 0.9025, 0.5306, 0.6912, 0.7625, 0.8289, 0.6985, 0.9188, 0.9639, 0.3178, 0.7288, 0.4129, 0.2196, 0.9304, 0.9697, 0.193, 0.1474, 0.3111, 0.8844, 0.7386, 0.9674, 0.9983, 0.4863, 0.9338, 0.7999, 0.4696, 0.5078, 0.5141, 0.9958, 0.6404, 0.2886, 0.4171, 0.9856, 0.3261, 0.9713, 0.682, 0.7686, 0.8577, 0.9481, 0.6057, 0.934, 0.3161, 0.9414, 0.8349, 0.8325, 0.8913, 0.7726, 0.7327, 0.1403, 0.8144, 0.7506, 0.225, 0.4884, 0.9052, 0.2891, 0.1652, 0.7612, 0.9403, 0.9865, 0.4107, 0.6518, 0.893, 0.4981, 0.72, 0.3366, 0.8437, 0.2551, 0.7753, 0.5, 0.7857, 0.7107, 0.5643, 0.2887, 0.9621, 0.2384, 0.414, 0.86, 0.6917, 0.4946, 0.2325, 0.3419, 0.9219, 0.2706, 0.717, 0.2327, 0.7541, 0.9692, 0.5838, 0.9346, 0.4739, 0.3219, 0.9634, 0.3046, 0.9913, 0.8485, 0.3071, 0.0373, 0.9183, 0.7935, 0.0039, 0.5968, 0.3654, 0.595, 0.9712, 0.2745, 0.6027, 0.7441, 0.7641, 0.3582, 0.3397, 0.7748, 0.8188, 0.0604, 0.5076, 0.2856, 0.6859, 0.6705, 0.0326, 0.8749, 0.2596, 0.1138, 0.6072, 0.4, 0.9241, 0.612, 0.2375, 0.2495, 0.0661, 0.3234, 0.7651, 0.8581, 0.4818, 0.7303, 0.7458, 0.8925, 0.2861, 0.982, 0.0791, 0.2474, 0.4326, 0.8757, 0.5288, 0.6476, 0.8473, 0.9098, 0.9562, 0.8464, 0.5444, 0.9738, 0.706, 0.0795, 0.391, 0.3167, 0.3311, 0.5681, 0.27, 0.9046, 0.2299, 0.2299, 0.085, 0.4002, 0.7443, 0.9865, 0.7028, 0.9016, 0.6092, 0.2367, 0.5402, 0.9401, 0.8013, 0.993, 0.2473, 0.6414)
                     )
str(ar1est)
## 'data.frame':    200 obs. of  3 variables:
##  $ mu   : num  0.0105 0.0185 0.022 0.0113 0.0144 0.0175 -0.0009 0.0093 0.0111 -0.0009 ...
##  $ sigma: num  0.0081 0.0053 0.0069 0.0075 0.0082 0.006 0.0101 0.011 0.0064 0.0066 ...
##  $ phi  : num  0.42 0.351 0.82 0.53 0.149 ...
# Generate trajectories for all rows of the estimation dataset
trajs <- ar1_multblocks(seq_along(nrow(ar1est)), rate0 = 0.015,  block_size = 10, traj_len = 15)

# Show results
show_migration(trajs)

# Load package
library(parallel)

# How many physical cores are available?
ncores <- detectCores(logical = FALSE)

# Create a cluster
cl <- makeCluster(ncores)

# Process rnorm in parallel
clusterApply(cl, 1:ncores, fun = rnorm, mean = 10, sd = 2)
## [[1]]
## [1] 10.87295
## 
## [[2]]
## [1] 11.271765  9.757961
# Evaluate partial sums in parallel
part_sums <- clusterApply(cl, x = c(1, 51), fun = function(x) sum(x:(x + 49)))

# Total sum
total <- sum(unlist(part_sums))

# Check for correctness
total == sum(1:100)
## [1] TRUE
# Stop the cluster
stopCluster(cl)


# Create a cluster and set parameters
cl <- makeCluster(2)
replicates <- 50
sample_size <- 10000

# Parallel evaluation
means <- clusterApply(cl, x = rep(sample_size, replicates), fun = myfunc)
                
# View results as histogram
hist(unlist(means))


Chapter 2 - The parallel package

Cluster basics:

  • The parallel package consists of two parts - snow (Tuerney) and multicore (Urbanek)
    • The snow can work on any operating system
    • The multicore works on most systems but not on Windows
  • Supported backends for snow are managed automatically by the parallel package
    • cl <- makeCluster(ncores, type = “PSOCK”) # default socket communication, works on all OS, all clusters start with a completely empty environment
    • cl <- makeCluster(ncores, type = “FORK”) # all OS except Windows, all workers are complete copies of the master environment
    • cl <- makeCluster(ncores, type = “MPI”) # interface provided by Rmpi and may be more efficient on machines where MPI is enabled

Core of parallel:

  • The main processing functions are clusterApply and clusterApplyLB (“load balanced”)
  • The wrapper functions include parApply, parLapply, parSapply, parRapply (rows of a matrix), parCapply (columns of a matrix)
    • Further, parLapplyLB, parSapplyLB are wrappers on the clusterApplyLB data
  • Example of using clusterApply for work on a pre-defined cluster cl
    • clusterApply(cl, x = arg.sequence, fun = myfunc) # each element of x is passed to myfunc (length of x is the total number of tasks)
  • There are several overheads involved in creating parallel processing - starting/stopping clusters, messages sent between nodes/masters, size of messages
    • Communications between master and workers is expensive, so long worker times are preferred in general
    • The overheads may sometimes be so significant as to make parallel processing more time-consuming than serial processing

Initialization of nodes:

  • Cluster nodes typically start with a clean, empty environment (default for sockets)
  • Repeated communications with the workers is expensive
    • clusterApply(cl, rep(1000, n), rnorm, sd = 1:1000) # master needs to send the vector to all the clusters (big overhead)
  • Good practice is to initialize workers at the beginning with everything that stays constant and/or is time consuming
    • Sending static data or datasets, loading libraries, evaluating global functions, etc.
  • The clusterCall() will call the same function with the same arguments on all the nodes
    • cl <- makeCluster(2)
    • clusterCall(cl, function() library(janeaustenr)) # will be loaded in all the clusters
    • clusterCall(cl, function(i) emma[i], 20) # will call the 20th element of emma from janeausten
  • The clusterEvalQ() will evaluate a literal expression on all nodes
    • cl <- makeCluster(2)
    • clusterEvalQ(cl, { library(janeaustenr) ; library(stringr) ; get_books <- function() austen_books()$book %>% unique %>% as.character }) # all books in the package
    • clusterCall(cl, function(i) get_books()[i], 1:3) # function get_books is available in the environment due to the above
  • The clusterExport() will export objects from master to the workers
    • books <- get_books()
    • cl <- makeCluster(2)
    • clusterExport(cl, “books”) # The books object is passed quoted
    • clusterCall(cl, function() print(books)) # books will be available since it was passed by clusterExport()

Subsetting data:

  • Each task is applied to a different data chunk; these can be made available to the worker in various ways
    • Random numbers on the fly
    • Arguments
    • Chunking on the workers side
  • Example of random numbers being created on the fly by the workers (reproducibility covered in later chapters)
    • myfunc <- function(n, …) mean(rnorm(n, …))
    • clusterApply(cl, rep(1000, 20), myfunc, sd = 6)
  • Example of chunking the data on the master side and then passing the data to workers as an argument
    • Incorporated in to parApply() by default
    • cl <- makeCluster(4)
    • mat <- matrix(rnorm(12), ncol=4)
    • parCapply(cl, mat, sum) # splits the matrix by column and passes to worker
    • unlist(clusterApply(cl, as.data.frame(mat), sum)) # needs to be converted to data.frame first for clusterApply()
  • Example of chunking data on the worker side (each pre-populated with the full data, chunking on the worker side) - saves communication time
    • n <- 100
    • M <- matrix(rnorm(n * n), ncol = n)
    • clusterExport(cl, “M”)
    • mult_row <- function(id) apply(M, 2, function(col) sum(M[id,] * col))
    • clusterApply(cl, 1:n, mult_row) %>% do.call(rbind, .)

Example code includes:

# Load parallel and create a cluster
library(parallel)
cl <- makeCluster(4)

# Investigate the cl object and its elements
typeof(cl)
## [1] "list"
length(cl)
## [1] 4
typeof(cl[[3]])
## [1] "list"
cl[[3]]$rank
## [1] 3
# What is the process ID of the workers
clusterCall(cl, Sys.getpid)
## [[1]]
## [1] 9912
## 
## [[2]]
## [1] 25568
## 
## [[3]]
## [1] 28984
## 
## [[4]]
## [1] 27584
# Stop the cluster
stopCluster(cl)


# Define ncores and a print function
ncores <- 2
print_ncores <- function() print(ncores)

# Create a socket and a fork clusters
# cl_sock <- makeCluster(ncores, type = "PSOCK")
# cl_fork <- makeCluster(ncores, type = "FORK")  # this is possible only on OS other than Windows

# Evaluate the print function on each cluster
# clusterCall(cl_sock, print_ncores)  # this will fail since the socket has no knowledge of the main environment
# clusterCall(cl_fork, print_ncores)

# Change ncores and evaluate again
# ncores <- 4
# clusterCall(cl_fork, print_ncores)  # the fork is only of the original environment, so these clusters will still think the answer is 2


# In this exercise, you will take the simple embarrassingly parallel application for computing mean of random numbers (myfunc()) from the first chapter, and implement two functions:
# One that runs the application sequentially, mean_seq(), and one that runs it in parallel, mean_par()
# Both functions have three arguments, n (sample size), repl (number of replicates) and ... (passed to myfunc())
# Function mean_par() assumes a cluster object cl to be present in the global environment

# Function to run repeatedly
myfunc <- function(n, ...) mean(rnorm(n, ...))

# Sequential solution
mean_seq <- function(n, repl, ...) {
    res <- rep(NA, repl)
    for (it in 1:repl) res[it] <- myfunc(n, ...)
    res
}

# Parallel solution
mean_par <- function(n, repl, ...) {
    res <- clusterApply(cl, x = rep(n, repl), fun = myfunc, ...)
    unlist(res)
}


# Load packages 
library(parallel)
library(microbenchmark)

# Create a cluster
cl <- makeCluster(2)

# Compare run times
microbenchmark(mean_seq(3000000, repl = 4), 
               mean_par(3000000, repl = 4),
               mean_seq(100, repl = 100), 
               mean_par(100, repl = 100),
               times = 1, unit = "s")
## Unit: seconds
##                       expr         min          lq        mean      median
##  mean_seq(3e+06, repl = 4) 2.781468964 2.781468964 2.781468964 2.781468964
##  mean_par(3e+06, repl = 4) 0.847188307 0.847188307 0.847188307 0.847188307
##  mean_seq(100, repl = 100) 0.003291041 0.003291041 0.003291041 0.003291041
##  mean_par(100, repl = 100) 0.180692568 0.180692568 0.180692568 0.180692568
##           uq         max neval
##  2.781468964 2.781468964     1
##  0.847188307 0.847188307     1
##  0.003291041 0.003291041     1
##  0.180692568 0.180692568     1
# Stop cluster               
stopCluster(cl)


# Load extraDistr on master
library(extraDistr)
## 
## Attaching package: 'extraDistr'
## The following object is masked from 'package:purrr':
## 
##     rdunif
# Define myrdnorm 
myrdnorm <- function(n, mean = 0, sd = 1) 
    rdnorm(n, mean = mean, sd = sd)

# Run myrdnorm in parallel - should fail
# res <- clusterApply(cl, rep(1000, 20), myrdnorm, sd = 6)   # will error out


# Load extraDistr on all workers
cl <- makeCluster(2)
clusterEvalQ(cl, library(extraDistr))
## [[1]]
## [1] "extraDistr" "stats"      "graphics"   "grDevices"  "utils"     
## [6] "datasets"   "methods"    "base"      
## 
## [[2]]
## [1] "extraDistr" "stats"      "graphics"   "grDevices"  "utils"     
## [6] "datasets"   "methods"    "base"
# Run myrdnorm in parallel again and show results
res <- clusterApply(cl, rep(1000, 20), myrdnorm, sd = 6)
hist(unlist(res))

# myrdnorm that uses global variables
myrdnorm <- function(n) rdnorm(n, mean = mean, sd = sd)

# Initialize workers 
clusterEvalQ(cl, {
    library(extraDistr)
    mean=10
    sd=5
    })
## [[1]]
## [1] 5
## 
## [[2]]
## [1] 5
# Run myrdnorm in parallel and show results
res <- clusterApply(cl, rep(1000, 100), myrdnorm)

# View results
hist(unlist(res))

# Set global objects on master
mean <- 20
sd <- 10

# Export global objects to workers
clusterExport(cl, c("mean", "sd"))

# Load extraDistr on workers
clusterEvalQ(cl, library(extraDistr))
## [[1]]
## [1] "extraDistr" "stats"      "graphics"   "grDevices"  "utils"     
## [6] "datasets"   "methods"    "base"      
## 
## [[2]]
## [1] "extraDistr" "stats"      "graphics"   "grDevices"  "utils"     
## [6] "datasets"   "methods"    "base"
# Run myrdnorm in parallel and show results
res <- clusterApply(cl, rep(1000, 100), myrdnorm)
hist(unlist(res))

select_words <- function(letter, words, min_length = 1) {
    min_length_words <- words[nchar(words) >= min_length]
    grep(paste0("^", letter), min_length_words, value = TRUE)
}

# Export "select_words" to workers
clusterExport(cl, "select_words")

# Split indices for two chunks
ind <- splitIndices(length(words), 2)

# Find unique words in parallel
result <- clusterApply(cl, x = list(words[ind[[1]]], words[ind[[2]]]),  
            function(w, ...) unique(select_words("v", w, ...)), 
            min_length = 10)
            
# Show vectorized unique results
unique(unlist(result))
##  [1] "voluntarily"    "variations"     "vindication"    "violoncello"   
##  [5] "vouchsafed"     "veneration"     "volatility"     "volubility"    
##  [9] "vigorously"     "villainous"     "vindicating"    "vulnerable"    
## [13] "vicissitudes"   "vegetation"     "vulgarisms"     "valetudinarian"
## [17] "vindicated"     "vouchsafing"    "voluminous"     "vehemently"    
## [21] "valancourt"     "venerating"     "viscountess"    "vanquished"
# Earlier you defined a function ar1_multblocks() that takes a vector of row identifiers as argument and generates migration trajectories using the corresponding rows of the parameter set ar1est
# ar1_multblocks() depends on ar1_block() which in turns depends on ar1_trajectory()
# These functions along with the cluster object cl of size 4, function show_migration(), the dataset ar1est (reduced to 200 rows) and packages parallel and ggplot2 are available in your workspace

ar1_block <- function(id, rate0 = 0.015, traj_len = 15, block_size = 10) {
    trajectories <- matrix(NA, nrow = block_size, ncol=traj_len)
    for (i in seq_len(block_size)) 
        trajectories[i,] <- ar1_trajectory(unlist(ar1est[id, ]), rate0 = rate0, len = traj_len)
    trajectories
}

ar1_trajectory <- function(est, rate0, len = 15) {
    ar1 <- function(est, r) {
        # simulate one AR(1) value
        est['mu'] + est['phi'] * (r - est['mu']) + 
        rnorm(1, sd = est['sigma'])
    }
    trajectory <- rep(NA, len)
    rate <- rate0
    for (time in seq_len(len)) {
        trajectory[time] <- ar1(est, r = rate)
        rate <- trajectory[time]
    }
    trajectory
}

ar1_multblocks <- function(ids, ...) {
    trajectories <- NULL
    for (i in seq_along(ids))
        trajectories <- rbind(trajectories, ar1_block(ids[i], ...))
    trajectories
}

# Export data and functions
clusterExport(cl, c("ar1est", "ar1_block", "ar1_trajectory"))

# Process ar1_multblocks in parallel
res <- clusterApply(cl, 1:nrow(ar1est), ar1_multblocks)

# Combine results into a matrix and show results
trajs <- do.call(rbind, res)
show_migration(trajs)

# The object res returned by clusterApply() in the previous exercise is also in your workspace, now called res_prev
res_prev <- res

# Split task into 5 chunks
ind <- splitIndices(nrow(ar1est), 5)

# Process ar1_multblocks in parallel
res <- clusterApply(cl, ind, ar1_multblocks)

# Dimensions of results 
(res_dim <- c(length(res), nrow(res[[1]])))
## [1]   5 400
(res_prev_dim <- c(length(res_prev), nrow(res_prev[[1]])))
## [1] 200  10
stopCluster(cl)

Chapter 3 - foreach, future.apply, and Load Balancing

foreach:

  • The looping construct can be applied using the foreach package (Calaway and Weston), similar to previous examples
  • The foreach makes it possible to run parallel processing for loops - code can be written the same way for both sequential and parallel applications
  • The basic syntax is foreach(.) %do% .
    • library(foreach)
    • foreach(n = rep(5, 3)) %do% rnorm(n)
    • foreach(n = rep(5, 3), m = 10^(0:2)) %do% rnorm(n, mean = m) # can pass arguments
    • The foreach() call will return a value - in the example above, this would be a list
  • Can combine results using post-processing arguments - the .combine argument
    • foreach(n = rep(5, 3), .combine = rbind) %do% rnorm(n) # rbind of the three lists
    • foreach(n = rep(5, 3), .combine = ‘+’) %do% rnorm(n) # sum across the three lists
  • Can also use list comprehensions with the foreach() function - using the %:% operator
    • foreach(x = sample(1:1000, 10), .combine = c) %:% when(x %% 3 == 0 || x %% 5 == 0) %do% x

foreach and parallel backends:

  • The most popular backend is the doParallel() call for parallel foreach() processing
    • Other backends include doFuture() using the future package and doSEQ() which allows switching between parallel and sequential
  • The doParallel package by Calaway et al is an interface between foreach and parallel, and requires initialization with registerDoParallel()
    • library(doParallel)
    • registerDoParallel(cores = 3) # uses multicore for Unix and snow for Windows
    • cl <- makeCluster(3) # can make your own clusters and pass them also
    • registerDoParallel(cl) # passing the cl object rather than cores (will default to snow since that is makeCluster() default
  • Examples of converting a sequential loop to a parallel loop
    • library(foreach)
    • foreach(n = rep(5, 3)) %do% rnorm(n)
    • library(doParallel)
    • cl <- makeCluster(3)
    • registerDoParallel(cl)
    • foreach(n = rep(5, 3)) %dopar% rnorm(n) # note the conversion to %dopar% which is what engages the parallel processing
  • The doFuture package (Bengtsson) is built on top of the future package
    • The central idea is that there is a future plan for how foreach should work behind the scenes - sequential, cluster, multicore, multiprocess, etc.
    • Other packages are available from future.batchtools
    • library(doFuture)
    • registerDoFuture()
    • plan(cluster, workers = 3) # using the cluster plan
    • foreach(n = rep(5, 3)) %dopar% rnorm(n) # same use of foreach, with the cluster plan applied
    • plan(multicore) # using the multicore plan
    • foreach(n = rep(5, 3)) %dopar% rnorm(n)

future and future.apply - packages that are continually under development:

  • The future package intends to have a uniform API for sequential and parallel processing
  • The future construct is for an expression that may be used in the future
    • Example in plain R - x <- mean(rnorm(n, 0, 1)) ; y <- mean(rnorm(n, 10, 5)) ; print(c(x, y))
    • Example in implicit futures - x %<-% mean(rnorm(n, 0, 1)) ; y %<-% mean(rnorm(n, 10, 5)) ; print(c(x, y))
    • Example in explicit futures - x <- future(mean(rnorm(n, 0, 1))) ; y <- future(mean(rnorm(n, 10, 5))) ; print(c(value(x), value(y)))
    • Values can be managed asynchronously, meaning that the next line can start running while the current line is still in process
  • The same code can then be run either in parallel or sequentially - sequential example
    • plan(sequential)
    • x %<-% mean(rnorm(n, 0, 1))
    • y %<-% mean(rnorm(n, 10, 5))
    • print(c(x, y))
  • The same code can then be run either in parallel or sequentially - parallel example
    • plan(multicore)
    • x %<-% mean(rnorm(n, 0, 1))
    • y %<-% mean(rnorm(n, 10, 5))
    • print(c(x, y))
  • The package future.apply is a higher level API for all the apply packages in R using futures
    • Sibling to foreach
    • functions include future_lapply(), future_sapply(), future_apply()
  • Example of using future.apply()
    • lapply(1:10, rnorm) # base R
    • plan(sequential)
    • future_lapply(1:10, rnorm) # future.apply implementation of base R above
    • plan(cluster, workers=4)
    • future_lapply(1:10, rnorm) # future.apply implementation of base R above, now run in parallel
  • Separating the plan from the processing allows for processing across many systems
    • Single CPU uses sequential, cluster plan for many cores, etc.

Load balancing and scheduling:

  • There can be significant node waiting times if the master waits for all workers to finish before assigning new tasks
    • The clusterApplyLB() is designed to speed up processing by sending new tasks to workers as soon as they finish their old tasks
  • Communication overhead can also be a big problem when the tasks are small
    • Can instead give the workers many tasks at once, and have them communicate with the master only at the start and the finish
    • Drawback is that idle workers will not be available to help busy workers
  • Can chunk in parallel using the splitIndices() function
    • splitIndices(10, 2)
    • clusterApply(cl, x = splitIndices(10, 2), fun = sapply, “*“, 100) # multiply all numbers by 100
    • Can also use parApply with the chunk.size option in R 3.5+
    • foreach(s = isplitVector(1:10, chunks = 2)) %dopar% sapply(s, “*“, 100) # itertools functions to implement in foreach
    • future_sapply(1:10, *, 100, future.scheduling = 1) # 1 chunk per worker
    • future_sapply(1:10, *, 100, future.scheduling = FALSE) # 1 chunk per task

Example code includes:

# Recall the first chapter where you found the most frequent words from the janeaustenr package that are of certain minimum length
result <- lapply(letters, max_frequency, words = words, min_length = 5) %>% 
    unlist
# In this exercise, you will implement the foreach construct to solve the same problem
# The janeaustenr package, a vector of all words from the included books, words, and a function max_frequency() for finding the results based on a given starting letter are all available in your workspace

# Load the package
library(foreach)
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
# foreach construct
result <- foreach(l = letters, .combine=c) %do% max_frequency(l, words=words, min_length=5)
                
# Plot results 
barplot(result, las = 2)

# Specifically, your job is to modify the code so that the maximum frequency for the first half of the alphabet is obtained for words that are two and more characters long, while the frequency corresponding to the second half of the alphabet is derived from words that are six and more characters long
# Note that we are using an alphabet of 26 characters

# foreach construct and combine into vector
result <- foreach(l = letters, n = rep(c(2, 6), each=13), .combine = c) %do% 
    max_frequency(l, words=words, min_length=n)
          
# Plot results
barplot(result, las = 2)

# Register doParallel with 2 cores
doParallel::registerDoParallel(cores=2)

# Parallel foreach loop
res <- foreach(r = rep(1000, 100), .combine = rbind, 
            .packages = "extraDistr") %dopar% myrdnorm(r)
            
# Dimensions of res
dim_res <- dim(res)


# So far you learned how to search for the most frequent word in a text sequentially using foreach()
# In the course of the next two exercises, you will implement the same task using doParallel and doFuture for parallel processing and benchmark it against the sequential version
# The sequential solution is implemented in function freq_seq() (type freq_seq in your console to see it)
# It iterates over a global character vector chars and calls the function max_frequency() which searches within a vector of words, while filtering for minimum word length
# All these objects are preloaded, as is the doParallel package
# Your job now is to write a function freq_doPar() that runs the same code in parallel via doParallel

freq_seq <- function(min_length = 5)
    foreach(l = letters, .combine = c) %do% 
        max_frequency(l, words = words, min_length = min_length)

# Function for doParallel foreach
freq_doPar <- function(cores, min_length = 5) {
    # Register a cluster of size cores
    doParallel::registerDoParallel(cores=cores)
    
    # foreach loop
    foreach(l=letters, .combine=c, 
            .export = c("max_frequency", "select_words", "words"),
            .packages = c("janeaustenr", "stringr")) %dopar%
        max_frequency(l, words=words, min_length=min_length)
}

# Run on 2 cores
freq_doPar(cores=2)
##     again     being     could     darcy     every     first     great 
##      1001      1445      3613       373      1456       972       981 
##   herself    indeed  jennings knightley    little     might     never 
##      1360       664       199       356      1295      1369      1362 
##     other     place     quite    really    should     there     under 
##      1084       503       870       504      1541      2209       293 
##     visit     would     xviii     young   zealous 
##       294      3238         4       766         5
# Now your job is to create a function freq_doFut() that accomplishes the same task as freq_doPar() but with the doFuture backend
# Note that when using doFuture, arguments .packages and .export in foreach() are not necessary, as the package deals with the exports automatically
# You will then benchmark these two functions, together with the sequential freq_seq()
# All the functions from the last exercise are available in your workspace
# In addition, the packages doFuture and microbenchmark are also preloaded
# To keep the computation time low, the global chars vector is set to the first six letters of the alphabet only

cores <- 2
min_length <- 5

# Error in tweak.function(strategy, ..., penvir = penvir) : 
# Trying to use non-future function 'survival::cluster': function (x)  { ... }
# For solution see https://github.com/HenrikBengtsson/future/issues/152

# Function for doFuture foreach
freq_doFut <- function(cores, min_length = 5) {
    # Register and set plan
    doFuture::registerDoFuture()
    future::plan(future::cluster, workers=cores)

    # foreach loop
    foreach(l = letters, .combine = c) %dopar%
        max_frequency(l, words = words, min_length = min_length)
}

# Benchmark
microbenchmark(freq_seq(min_length),
               freq_doPar(cores, min_length),
               freq_doFut(cores, min_length),
               times = 1)
## Unit: seconds
##                           expr       min        lq      mean    median
##           freq_seq(min_length)  7.244515  7.244515  7.244515  7.244515
##  freq_doPar(cores, min_length)  9.763635  9.763635  9.763635  9.763635
##  freq_doFut(cores, min_length) 11.379522 11.379522 11.379522 11.379522
##         uq       max neval
##   7.244515  7.244515     1
##   9.763635  9.763635     1
##  11.379522 11.379522     1
# It is straight forward to swap parallel backends with foreach
# In this small example, you might not see any time advantage in running it in parallel
# In addition, doFuture is usually somewhat slower than doParallel
# This is because doFuture has a higher computation overhead
# We encourage you to test these frameworks on more time-consuming applications where an overhead become negligible relative to the overall processing time

extract_words_from_text <- function(text) {
    str_extract_all(text, boundary("word")) %>% 
        unlist %>% 
        tolower
}

# Main function
freq_fapply <- function(words, chars=letters, min_length=5) {
    unlist(future.apply::future_lapply(chars, FUN=max_frequency, words = words, min_length = min_length))
}

obama <- readLines("./RInputFiles/obama.txt")
obama_speech <- paste(obama[obama != ""], collapse=" ")

# Extract words and call freq_fapply
words <- extract_words_from_text(obama_speech)
res <- freq_fapply(words)

# Plot results
barplot(res, las = 2)

# Now imagine you are a user of the fictional package from the previous exercise
# At home you have a two-CPU Mac computer, and at work you use a Linux cluster with two 16-CPU computers, called "oisin" and "oscar"
# Your job is to write a function for each of the hardware that calls freq_fapply() while taking advantage of all available CPUs
# For the cluster, you set workers to a vector of computer names corresponding to the number of CPUs, i.e. 16 x "oisin" and 16 x "oscar"
# For a one-CPU environment, we have created a function fapply_seq()

# fapply_seq <- function(...) {
#     future::plan(strategy="sequential") 
#     freq_fapply(words, letters, ...)
# }

# multicore function
# fapply_mc <- function(cores=2, ...) {
#     plan(strategy="multicore", workers=cores)
#     freq_fapply(words, letters, ...)
# }

# cluster function
# fapply_cl <- function(cores=NULL, ...) {
#     # set default value for cores
#     if(is.null(cores))
#         cores <- rep(c("oisin", "oscar"), each = 16)
#         
#     # parallel processing
#     plan(strategy="cluster", workers=cores)
#     freq_fapply(words, letters, ...)
# }


# Note: Multicore does not work on Windows. We recommend using the 'multiprocess' or 'cluster' plan on Windows computers.

# Microbenchmark
# microbenchmark(fapply_seq = fapply_seq(),
#                fapply_mc_2 = fapply_mc(cores=2), 
#                fapply_mc_10 = fapply_mc(cores=10),
#                fapply_cl = fapply_cl(cores=2), 
#                times = 1)

# Which is the slowest?
# slowest1 <- "fapply_cl"


# This is because for a small number of tasks a sequential code can run faster than a parallel version due to the parallel overhead
# The cluster plan has usually the largest overhead and thus, should be used only for larger number of tasks
# The multicore may be more efficient when the number of workers is equal to the number of cores
# It uses shared memory, and thus is faster than cluster


# In your workspace there is a vector tasktime containing simulated processing times of 30 tasks (generated using runif())
# There is also a cluster object cl with two nodes
# Your job is to apply the function Sys.sleep() to tasktime in parallel using clusterApply() and clusterApplyLB() and benchmark them
# The parallel and microbenchmark packages are loaded
# We also provided functions for plotting cluster usage plots called plot_cluster_apply() and plot_cluster_applyLB()
# Both functions use functionality from the snow package

tasktime <- c(0.1328, 0.1861, 0.2865, 0.4541, 0.1009, 0.4492, 0.4723, 0.3304, 0.3146, 0.031, 0.1031, 0.0884, 0.3435, 0.1921, 0.3849, 0.2489, 0.3588, 0.496, 0.1901, 0.3887, 0.4674, 0.1062, 0.3259, 0.0629, 0.1337, 0.1931, 0.0068, 0.1913, 0.4349, 0.1702)

# plot_cluster_apply <- function(cl, x, fun) 
#     plot(snow::snow.time(snow::clusterApply(cl, x, fun)),
#             title = "Cluster usage of clusterApply")

# plot_cluster_applyLB <- function(cl, x, fun) 
#     plot(snow::snow.time(snow::clusterApplyLB(cl, x, fun)),
#             title = "Cluster usage of clusterApplyLB")

# Benchmark clusterApply and clusterApplyLB
# microbenchmark(
#     clusterApply(cl, tasktime, Sys.sleep),
#     clusterApplyLB(cl, tasktime, Sys.sleep),
#     times = 1
# )

# Plot cluster usage
# plot_cluster_apply(cl, tasktime, Sys.sleep)
# plot_cluster_applyLB(cl, tasktime, Sys.sleep)


# Now we compare the results from the previous exercise with ones generated using parSapply(), which represents here an implementation that groups tasks into as many chunks as there are workers available
# You first explore its cluster usage plot, using the function plot_parSapply() we defined for you
# We generated a version of the tasktime vector, called bias_tasktime that generates very uneven load
# Your job is to compare the run times of parSapply() with clusterApplyLB() applied to bias_tasktime

# plot_parSapply <- function(cl, x, fun) 
#     plot(snow::snow.time(snow::parSapply(cl, x, fun)),
#             title = "Cluster usage of parSapply")

# bias_tasktime <- c(1, 1, 1, 0.1, 0.1, 0.1, 1e-04, 1e-04, 1e-04, 0.001, 1)

# Plot cluster usage for parSapply
# plot_parSapply(cl, tasktime, Sys.sleep)

# Microbenchmark
# microbenchmark(
#     clusterApplyLB(cl, bias_tasktime, Sys.sleep),
#     parSapply(cl, bias_tasktime, Sys.sleep),
#     times = 1
# )

# Plot cluster usage for parSapply and clusterApplyLB
# plot_cluster_applyLB(cl, bias_tasktime, Sys.sleep)
# plot_parSapply(cl, bias_tasktime, Sys.sleep)

Chapter 4 - Random Numbers and Reproducibility

Are results reproducible?

  • Many statistical numbers require the use of random numbers - MCMC, boot, simulation, sample, rnorm, etc.
    • Typically, would use a set.seed() to initialize the RNG to a known state
  • Setting the seed typically does not guarantee reproducibility of parallel processing
    • library(parallel)
    • cl <- makeCluster(2)
    • set.seed(1234)
    • clusterApply(cl, rep(3, 2), rnorm)
    • set.seed(1234) # only sets the RNG in the master node; thus not sent to the workers
    • clusterApply(cl, rep(3, 2), rnorm) # will get different sets of results
  • Can instead set the RNG for each of the workers
    • clusterEvalQ(cl, set.seed(1234))
    • clusterApply(cl, rep(3, 2), rnorm) # both clusters will give the identicla results
  • There is another common and not recommended method of generating random numbers in parallel code - gives statistical properties that are not desirable
    • for (i in 1:2) {
    • set.seed(1234)
    • clusterApply(cl, sample(1:10000000, 2), set.seed)
    • print(clusterApply(cl, rep(3, 2), rnorm))
    • }

Parallel random number generators:

  • A good RNG should comply with certain parameters - long period > 2**100, good structural properties in high dimensions, reproducible
    • These parameters should hold in a distributed environment also
  • RNG streams with period 2291 and seeds 2127 steps apart is available based on research by L’Ecuyer
    • Allows for each part of a parallel process to have reproducible streams with the proper properties
    • Available in R through rlecuyer and rstream
    • Also available in R Code using RNGkind(“L’Ecuyer-CMRG”)
  • The L’Ecuyer generator is the default when using parallel but needs to be initialized with a seed for cluster cl
    • clusterSetRNGStream(cl, iseed = 1234) # initializes a reproducible and independent seed for each of the workers
  • Reproducibility in parallel depends on the task flow - certain conditions need to apply
    • Runs on clusters of the same size (different number of workers means different random numbers)
    • Cannot use load balancing (balancing loads means non-deterministic scheduling and assignment)

Reproducibility in foreach and future.apply:

  • Can achieve reproducibility in foreach using doRNG - one stream per task
    • library(doRNG)
    • library(doParallel)
    • registerDoParallel(cores = 3)
    • set.seed(1)
    • res1 <- foreach(n = rep(2, 5), .combine = rbind) %dorng% rnorm(n)
    • set.seed(1)
    • res2 <- foreach(n = rep(2, 5), .combine = rbind) %dorng% rnorm(n)
    • identical(res1, res2) # TRUE
  • Can also use doRNG by way of %dopar%
    • library(doRNG)
    • library(doParallel)
    • registerDoParallel(cores = 3)
    • registerDoRNG(1) # 1 is the seed
    • res3 <- foreach(n = rep(2, 5), .combine = rbind) %dopar% rnorm(n)
    • set.seed(1)
    • res4 <- foreach(n = rep(2, 5), .combine = rbind) %dopar% rnorm(n)
    • c(identical(res1, res3), identical(res2, res4)) # TRUE TRUE
  • Can also use independent streams in future.apply
    • library(future.apply)
    • plan(sequential)
    • res5 <- future_lapply(1:5, FUN = rnorm, future.seed = 1234)
    • plan(multiprocess)
    • res6 <- future_lapply(1:5, FUN = rnorm, future.seed = 1234)
    • identical(res5, res6) # TRUE

Next steps:

  • The parallel package is the baseline for many other packages
    • Often not reproducible
  • The foreach package with doParallel and doFuture can be reproducible using doRNG
  • The future.apply package has an inutitive apply-like syntax and is always reproducible if future.seed is set
  • General best practices include
    • Minimize overhead and master-worker communication frequencies
    • Use scheduling and load balancing through effective use of chunking
    • Be careful that parallel overhead can actually make tasks longer if the tasks are all small and the communication takes more time than the calculations

Example code includes:

# In addition to the code in the previous exercise, we also created a FORK cluster for you.

# cl.fork <- makeCluster(2, type = "FORK")

# Your job is to register the two cluster objects with the preloaded doParallel package and compare results obtained with parallel foreach
# How do the results differ in terms of reproducibility?

library(doParallel)
## Loading required package: iterators
cl.sock <- makeCluster(2, type = "PSOCK")
registerDoParallel(cl.sock)
set.seed(100)
foreach (i = 1:2) %dopar% rnorm(3)
## [[1]]
## [1]  1.2266717  0.8355303 -0.4718668
## 
## [[2]]
## [1]  0.4870642  1.3446289 -0.5212480
# Register and use cl.sock
registerDoParallel(cl.sock)
replicate(2, {
    set.seed(100)
    foreach(i = 1:2, .combine = rbind) %dopar% rnorm(3)
    }, simplify = FALSE
)
## [[1]]
##                [,1]       [,2]       [,3]
## result.1 -0.5379697 -0.2982148 -0.6482285
## result.2 -1.4116888  0.9865273  1.6457373
## 
## [[2]]
##               [,1]      [,2]        [,3]
## result.1 0.6753372 1.2918483 -2.10406311
## result.2 0.8411086 0.5899088 -0.05641572
# Register and use cl.fork
# registerDoParallel(cl.fork)
# replicate(2, {
#     set.seed(100)
#     foreach(i = 1:2, .combine = rbind) %dopar% rnorm(3)
#     }, simplify = FALSE
# )


# Create a cluster
cl <- makeCluster(2)

# Check RNGkind on workers
clusterCall(cl, RNGkind)
## [[1]]
## [1] "Mersenne-Twister" "Inversion"       
## 
## [[2]]
## [1] "Mersenne-Twister" "Inversion"
# Set the RNG seed on workers
clusterSetRNGStream(cl, iseed=100)

# Check RNGkind on workers
clusterCall(cl, RNGkind)
## [[1]]
## [1] "L'Ecuyer-CMRG" "Inversion"    
## 
## [[2]]
## [1] "L'Ecuyer-CMRG" "Inversion"
# Now you are ready to make your results reproducible
# You will use the simple embarrassingly parallel application for computing a mean of random numbers (myfunc) which we parallelized in the second chapter using clusterApply()
# The parallel package, myfunc() , n (sample size, set to 1000) and repl (number of replicates, set to 5) are available in your workspace
# You will now call clusterApply() repeatedly to check if results can be reproduced, without and with initializing the RNG

n <- 1000
repl <- 5

# Create a cluster of size 2
cl <- makeCluster(2)

# Call clusterApply three times
for(i in 1:3)
    print(unlist(clusterApply(cl, rep(n, repl), myfunc)))
## [1] -0.023711123 -0.036812662  0.004440612 -0.022992896  0.028865729
## [1] -0.039213404 -0.007297949 -0.047685251  0.054481415  0.042652353
## [1]  0.004765250  0.038449842 -0.033179686  0.046234370 -0.009752082
# Create a seed object
seed <- 1234

# Repeatedly set the cluster seed and call clusterApply()
for(i in 1:3) {
    clusterSetRNGStream(cl, iseed = seed)
    print(unlist(clusterApply(cl, rep(n, repl), myfunc)))
}
## [1] -0.008597904 -0.006089337 -0.013980519 -0.066293388  0.004297755
## [1] -0.008597904 -0.006089337 -0.013980519 -0.066293388  0.004297755
## [1] -0.008597904 -0.006089337 -0.013980519 -0.066293388  0.004297755
# Create two cluster objects, of size 2 and 4
cl2 <- makeCluster(2)
cl4 <- makeCluster(4)

# Set seed on cl2 and call clusterApply
clusterSetRNGStream(cl2, iseed = seed)
unlist(clusterApply(cl2, rep(n, repl), myfunc))
## [1] -0.008597904 -0.006089337 -0.013980519 -0.066293388  0.004297755
# Set seed on cl4 and call clusterApply
clusterSetRNGStream(cl4, iseed = seed)
unlist(clusterApply(cl4, rep(n, repl), myfunc))
## [1] -0.008597904 -0.006089337  0.077876985 -0.072012937 -0.013980519
# Register doParallel and doRNG
library(doRNG)
## Loading required package: rngtools
## Loading required package: pkgmaker
## Loading required package: registry
## 
## Attaching package: 'pkgmaker'
## The following object is masked from 'package:base':
## 
##     isFALSE
registerDoParallel(cores = 2)
doRNG::registerDoRNG(seed)

# Call ar1_block via foreach
mpar <- foreach(r=1:5) %dopar% ar1_block(r)

# Register sequential backend, set seed and run foreach
registerDoSEQ()
set.seed(seed)
mseq <- foreach(r=1:5) %dorng% ar1_block(r)

# Check if results identical
identical(mpar, mseq)
## [1] TRUE
# You are able to reproduce sequential and parallel applications! Remember to always use %dorng% if you use the doSEQ backend
# Also note that by default on the Linux DataCamp server, registerDoParallel() creates a FORK cluster if a number of cores is passed to it
# As a result, there was no need to export any functions to workers, as they were copied from the master
# On a different platform, the .export option may be needed


# Set multiprocess plan 
future::plan(strategy="multiprocess", workers = 2)

# Call ar1_block via future_lapply
mfpar <- future.apply::future_lapply(1:5, FUN=ar1_block, future.seed=seed)

# Set sequential plan and repeat future_lapply
future::plan(strategy="sequential")
mfseq <- future.apply::future_lapply(1:5, FUN=ar1_block, future.seed=seed)

# Check if results are identical
identical(mfpar, mfseq)
## [1] TRUE
rm(mean)
rm(sd)

Marketing Analytics in R: Choice Modeling

Chapter 1 - Quickstart Guide

Why choice?

  • Choice modeling (and conjoint) is a common and popular tool used in marketing
    • Linear regression is about predicting a number based on features
    • Frequently, though, we want to make a choose from a selection of objects (picking a show, purchasing a car, etc.)
  • Multinomial logit (logistic regressions) work well with choice data
  • Choice models can be helpful for feature selection, pricing, trade-offs between quality/cost, etc.

Inspecting choice data:

  • Choice data does not fit in to normal predictive modeling formats
    • For regression, data are typically one row per observation
    • For choice datasets, data are typically stacked in to a few rows, where each row describes an alternative (rather than an observation), with a flag for which option was chosen
  • May want to count up the number of choices made (total or proportions)
    • xtabs(choice ~ price, data=sportscar)

Fitting and interpreting a choice model:

  • Fitting choice models is similar to fitting linear models
    • my_model <- lm(y ~ x1 + x2 + x3, data=lm_data)
    • library(mlogit) # multinomial logit is needed rather than GLM
    • mymodel <- mlogit(choice ~ feature1 + feature2 + feature3, data = choice_data) # data must be choice data, including both ques, alt, and choice columns
    • summary(mymodel)
  • Coefficients of magnitude greater than 1 are of very high impact in the decisions (rule of thumb)

Using choice models to make decisions:

  • Can use the choice models to predict market shares
    • predict_mnl(model, products) # for mlogit models - written by instructor

Example code includes:

# Unload conflicting namespaces
unloadNamespace("rms")
unloadNamespace("quantreg")
unloadNamespace("MatrixModels")

unloadNamespace("lmerTest")
unloadNamespace("semPlot")
unloadNamespace("rockchalk")
unloadNamespace("qgraph")
unloadNamespace("sem")
unloadNamespace("mi")
unloadNamespace("arm")
unloadNamespace("mice")
unloadNamespace("mitml")
unloadNamespace("jomo")
unloadNamespace("arm")
unloadNamespace("jomo")
unloadNamespace("lme4")


# load the mlogit library 
library(mlogit)


scLong <- read.csv("./RInputFiles/sportscar_choice_long.csv")
scWide <- read.csv("./RInputFiles/sportscar_choice_wide.csv")
sportscar <- scLong
sportscar$alt <- as.factor(sportscar$alt)
sportscar$seat <- as.factor(sportscar$seat)
sportscar$price <- as.factor(sportscar$price)
sportscar$choice <- as.logical(sportscar$choice)
sportscar <- sportscar %>% rename(resp.id=resp_id)
sportscar$key <- rep(1:2000, each=3)
row.names(sportscar) <- paste(sportscar$key, sportscar$alt, sep=".")
sportscar <- mlogit.data(sportscar, shape="long", choice="choice", alt.var="alt")
str(sportscar)

# Create a table of chosen sportscars by transmission type
chosen_by_trans <- xtabs(choice ~ trans, data = sportscar)

# Print the chosen_by_trans table to the console
chosen_by_trans

# Plot the chosen_by_price object
barplot(chosen_by_trans)


# Crashes out due to issue with class "family" in MatrixModels and lme4
m1 <- mlogit(choice ~ seat + trans + convert + price, data=sportscar, seed=10)

# fit a choice model using mlogit() and assign the output to m1
# m1 <- mlogit::mlogit(choice ~ seat + trans + convert + price, 
#                      data=sportscar, 
#                      chid.var="key", 
#                      alt.var="alt", 
#                      choice="choice", 
#                      seed=10
#                      )

# summarize the m1 object to see the output of the choice model
summary(m1)


predict_mnl <- function(model, products) {
  # model: mlogit object returned by mlogit()
  # data: a data frame containing the set of designs for which you want to 
  #       predict shares.  Same format at the data used to estimate model. 
  data.model <- model.matrix(update(model$formula, 0 ~ .), data = products)[,-1]
  utility <- data.model%*%model$coef
  share <- exp(utility)/sum(exp(utility))
  cbind(share, products)
}

# inspect products
products <- data.frame(seat=factor("2", levels=c("2", "4", "5")), 
                       trans=factor(rep(c("manual", "auto"), each=2), levels=c("auto", "manual")), 
                       convert=factor(rep(c("no", "yes"), times=2), levels=c("no", "yes")), 
                       price=factor("35", levels=c("30", "35", "40"))
                       )
str(products)

# use predict_mnl to predict share for products
shares <- predict_mnl(m1, products)

# print the shares to the console
shares


barplot(shares$share, ylab="Predicted Market Share", 
        names.arg=c("Our Car", "Comp 1", "Comp 2", "Comp 3"))

Chapter 2 - Managing and Summarizing Choice Data

Assembling choice data:

  • Choices made in the wild (revealed preference data) can be analyzed using the transaction record and product set available
  • Can instead run a survey with hypothetical decision making (conjoint)
  • Sometimes, data are provided in wide format, with sets of columns describing the choices

Converting from wide to long:

  • Often helpful to convert the wide format data to long format instead
    • sportscar <- reshape(sportscar_wide, direction=“long”, varying = list(seat=5:7, trans=8:10, convert=11:13, price=14:16), v.names = c(“seat”, “trans”, “convert”, “price”), timevar=“alt”) # column labels are given in v.names; timevar=“alt” means make an alt column
    • new_order <- order(sportscar\(resp_id, sportscar\)ques, sportscar$alt)
    • sportscar <- sportscar[new_order,] # ordered by question
    • sportscar\(choice <- sportscar\)choice == sportscar$alt # make a boolean rather than an integer

Choice data in two files:

  • Can receive choice data from two separate files - alternatives and choices
    • sportscar <- merge(sportscar_choices, sportscar_alts, by=c(“resp_id”, “ques”))

Visualizing choce data:

  • Data in long format can be summarized and visualized
    • xtabs(~ trans, data = sportscar) # just get the totals by transmission
    • xtabs(~ trans + choice, data = sportscar) # COUNT of transmission by choice
    • xtabs(choice ~ trans, data=sportscar) # SUM of choice by trans
    • plot(xtabs(~ trans + choice, data=sportscar)) # mosaic plot
    • plot(xtabs(~ trans + segment + choice, data=sportscar)) # mosaic plot split primarily by trans, then with segment and choice

Designing a conjoint survey:

  • Conjoint surveys are popular for product design - can include any number of features
    • Begin picking attributes of interest (commonly 8-10) and levels of interest (commonly 2-5 per level)
  • Need to decide which product profiles to show to which users and which questions
  • Can create a random design in R
    • attribs <- list(Type=c(“Milk”, “Dark”, “White”), Brand=c(“Cadbury”, “Toblerone”, “Kinder”), Price=5:30/10)
    • all_comb <- expand.grid(attribs)
    • for (i in 1:100) {
    • rand_rows <- sample(1:nrow(all_comb), size=12*3)
    • rand_alts <- all_comb[rand_rows, ]
    • choc_survey[choc_survey$Subject==i, 4:6] <- rand_alts
    • }
  • Can code the survey in html or use a platform like Google or Survey Monkey
    • Can also use a firm like Qualtrics

Example code includes:

chLong <- read.csv("./RInputFiles/chocolate_choice_long.csv")
chWide <- read.csv("./RInputFiles/chocolate_choice_wide.csv")
chocolate_wide <- chWide

# Look at the head() of chocolate_wide
head(chocolate_wide)
##   Subject Trial    Brand1 Brand2      Brand3 Price1 Price2 Price3
## 1    2401     1      Dove Godiva        Dove    0.6    0.7    3.6
## 2    2401     2    Godiva Godiva   Hershey's    2.7    3.9    0.7
## 3    2401     3 Hershey's Godiva   Hershey's    1.7    3.7    3.0
## 4    2401     4     Lindt  Lindt Ghirardelli    1.0    3.6    0.5
## 5    2401     5 Hershey's Godiva   Hershey's    0.8    1.5    3.3
## 6    2401     6     Lindt   Dove      Godiva    3.1    2.5    2.6
##          Type1        Type2        Type3 Selection Response_Time
## 1         Milk         Dark        White         1          5210
## 2 Milk w/ Nuts         Dark Milk w/ Nuts         2          7480
## 3 Dark w/ Nuts         Dark         Dark         2          7704
## 4         Milk Milk w/ Nuts Dark w/ Nuts         1          5774
## 5 Milk w/ Nuts         Dark        White         2          5238
## 6         Milk        White         Dark         3          3423
# Use summary() to see which brands and types are in chocolate_wide
summary(chocolate_wide)
##     Subject         Trial            Brand1           Brand2  
##  Min.   :2401   Min.   : 1   Dove       :60   Dove       :85  
##  1st Qu.:2405   1st Qu.: 7   Ghirardelli:58   Ghirardelli:67  
##  Median :2410   Median :13   Godiva     :83   Godiva     :74  
##  Mean   :2409   Mean   :13   Hershey's  :63   Hershey's  :66  
##  3rd Qu.:2413   3rd Qu.:19   Lindt      :86   Lindt      :58  
##  Max.   :2417   Max.   :25                                    
##          Brand3       Price1          Price2          Price3     
##  Dove       :69   Min.   :0.500   Min.   :0.500   Min.   :0.500  
##  Ghirardelli:61   1st Qu.:1.100   1st Qu.:1.300   1st Qu.:1.300  
##  Godiva     :78   Median :2.200   Median :2.400   Median :2.200  
##  Hershey's  :78   Mean   :2.144   Mean   :2.255   Mean   :2.233  
##  Lindt      :64   3rd Qu.:3.100   3rd Qu.:3.200   3rd Qu.:3.100  
##                   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##           Type1             Type2             Type3      Selection    
##  Dark        :63   Dark        :95   Dark        :75   Min.   :1.000  
##  Dark w/ Nuts:70   Dark w/ Nuts:68   Dark w/ Nuts:75   1st Qu.:1.000  
##  Milk        :75   Milk        :55   Milk        :60   Median :2.000  
##  Milk w/ Nuts:83   Milk w/ Nuts:55   Milk w/ Nuts:67   Mean   :1.926  
##  White       :59   White       :77   White       :73   3rd Qu.:3.000  
##                                                        Max.   :3.000  
##  Response_Time  
##  Min.   : 1021  
##  1st Qu.: 2750  
##  Median : 3878  
##  Mean   : 4713  
##  3rd Qu.: 5766  
##  Max.   :24462
# use reshape() to change the data from long to wide 
chocolate <- reshape(data= chocolate_wide , direction="long", 
                     varying = list(Brand=3:5, Price=6:8, Type=9:11), 
                     v.names=c("Brand", "Price", "Type"), timevar="Alt")
                     
# use head() to confirm that the data has been properly transformed
head(chocolate)
##     Subject Trial Selection Response_Time Alt     Brand Price         Type
## 1.1    2401     1         1          5210   1      Dove   0.6         Milk
## 2.1    2401     2         2          7480   1    Godiva   2.7 Milk w/ Nuts
## 3.1    2401     3         2          7704   1 Hershey's   1.7 Dark w/ Nuts
## 4.1    2401     4         1          5774   1     Lindt   1.0         Milk
## 5.1    2401     5         2          5238   1 Hershey's   0.8 Milk w/ Nuts
## 6.1    2401     6         3          3423   1     Lindt   3.1         Milk
##     id
## 1.1  1
## 2.1  2
## 3.1  3
## 4.1  4
## 5.1  5
## 6.1  6
# Create the new order for the chocolate data frame
new_order <- order(chocolate$Subject, chocolate$Trial, chocolate$Alt)

# Reorder the chocolate data frame to the new_order
chocolate <- chocolate[new_order,]

# Look at the head() of chocolate to see how it has been reordered
head(chocolate)
##     Subject Trial Selection Response_Time Alt     Brand Price         Type
## 1.1    2401     1         1          5210   1      Dove   0.6         Milk
## 1.2    2401     1         1          5210   2    Godiva   0.7         Dark
## 1.3    2401     1         1          5210   3      Dove   3.6        White
## 2.1    2401     2         2          7480   1    Godiva   2.7 Milk w/ Nuts
## 2.2    2401     2         2          7480   2    Godiva   3.9         Dark
## 2.3    2401     2         2          7480   3 Hershey's   0.7 Milk w/ Nuts
##     id
## 1.1  1
## 1.2  1
## 1.3  1
## 2.1  2
## 2.2  2
## 2.3  2
# Use head(chocolate) and look at the Selection variable. 
head(chocolate)
##     Subject Trial Selection Response_Time Alt     Brand Price         Type
## 1.1    2401     1         1          5210   1      Dove   0.6         Milk
## 1.2    2401     1         1          5210   2    Godiva   0.7         Dark
## 1.3    2401     1         1          5210   3      Dove   3.6        White
## 2.1    2401     2         2          7480   1    Godiva   2.7 Milk w/ Nuts
## 2.2    2401     2         2          7480   2    Godiva   3.9         Dark
## 2.3    2401     2         2          7480   3 Hershey's   0.7 Milk w/ Nuts
##     id
## 1.1  1
## 1.2  1
## 1.3  1
## 2.1  2
## 2.2  2
## 2.3  2
# Transform the Selection variable to a logical indicator
chocolate$Selection <- chocolate$Alt == chocolate$Selection

# Use head(chocolate) to see how the Selection variable has changed. Now it is logical.
head(chocolate)
##     Subject Trial Selection Response_Time Alt     Brand Price         Type
## 1.1    2401     1      TRUE          5210   1      Dove   0.6         Milk
## 1.2    2401     1     FALSE          5210   2    Godiva   0.7         Dark
## 1.3    2401     1     FALSE          5210   3      Dove   3.6        White
## 2.1    2401     2     FALSE          7480   1    Godiva   2.7 Milk w/ Nuts
## 2.2    2401     2      TRUE          7480   2    Godiva   3.9         Dark
## 2.3    2401     2     FALSE          7480   3 Hershey's   0.7 Milk w/ Nuts
##     id
## 1.1  1
## 1.2  1
## 1.3  1
## 2.1  2
## 2.2  2
## 2.3  2
choc_choice <- chocolate %>%
    filter(Selection==TRUE) %>%
    mutate(Selection=Alt) %>%
    select(Subject, Trial, Response_Time, Selection)
choc_alts <- chocolate %>%
    select(Subject, Trial, Alt, Brand, Price, Type)

str(choc_choice)
## 'data.frame':    350 obs. of  4 variables:
##  $ Subject      : int  2401 2401 2401 2401 2401 2401 2401 2401 2401 2401 ...
##  $ Trial        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Response_Time: int  5210 7480 7704 5774 5238 3423 4691 3268 6719 3542 ...
##  $ Selection    : int  1 2 2 1 2 3 3 3 2 2 ...
str(choc_alts)
## 'data.frame':    1050 obs. of  6 variables:
##  $ Subject: int  2401 2401 2401 2401 2401 2401 2401 2401 2401 2401 ...
##  $ Trial  : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ Alt    : int  1 2 3 1 2 3 1 2 3 1 ...
##  $ Brand  : Factor w/ 5 levels "Dove","Ghirardelli",..: 1 3 1 3 3 4 4 3 4 5 ...
##  $ Price  : num  0.6 0.7 3.6 2.7 3.9 0.7 1.7 3.7 3 1 ...
##  $ Type   : Factor w/ 5 levels "Dark","Dark w/ Nuts",..: 3 1 5 4 1 4 2 1 1 3 ...
##  - attr(*, "reshapeLong")=List of 4
##   ..$ varying:List of 3
##   .. ..$ Brand: chr  "Brand1" "Brand2" "Brand3"
##   .. ..$ Price: chr  "Price1" "Price2" "Price3"
##   .. ..$ Type : chr  "Type1" "Type2" "Type3"
##   ..$ v.names: chr  "Brand" "Price" "Type"
##   ..$ idvar  : chr "id"
##   ..$ timevar: chr "Alt"
# Merge choc_choice and choc_alts
choc_merge <- merge(choc_choice, choc_alts, by=c("Subject", "Trial"))

# Convert Selection to a logical variable
choc_merge$Selection <- choc_merge$Selection == choc_merge$Alt

# Inspect chocolate_merge using head
head(choc_merge)
##   Subject Trial Response_Time Selection Alt     Brand Price         Type
## 1    2401     1          5210      TRUE   1      Dove   0.6         Milk
## 2    2401     1          5210     FALSE   2    Godiva   0.7         Dark
## 3    2401     1          5210     FALSE   3      Dove   3.6        White
## 4    2401    10          3542     FALSE   1     Lindt   0.6 Milk w/ Nuts
## 5    2401    10          3542      TRUE   2    Godiva   0.8 Milk w/ Nuts
## 6    2401    10          3542     FALSE   3 Hershey's   3.7         Dark
# Use xtabs to count up how often each Type is chosen
counts <- xtabs(~ Type + Selection, data=chocolate)

# Plot the counts
plot(counts, cex = 1.5)

# Modify this code to count up how many times each **Brand** is chosen
counts <- xtabs(~ Brand + Selection, data=chocolate)

# Plot the counts
plot(counts, cex = 1.5)

# Use xtabs to count up how often each Price is chosen
counts <- xtabs(~ Price + Selection, data=chocolate)

# Plot the counts
plot(counts, cex=0.6)


Chapter 3 - Building Choice Models

Choice models - under the hood:

  • The multimonial logit model begins with a linear equation for utility which drives probabilities
    • v1 <- alpha * 4 + beta * 100 # value
    • v2 <- alpha * 5 + beta * 150
    • v2 <- alpha * 2 + beta * 175
    • u1 <- v1 + error1 # utility
    • u2 <- v2 + error2
    • u3 <- v3 + error3
    • choice <- which.max(c(u1, u2, u3))
    • p1 <- exp(v1) / ( exp(v1) + exp(v2) + exp(v3) )
    • p2 <- exp(v2) / ( exp(v1) + exp(v2) + exp(v3) )
    • p3 <- exp(v3) / ( exp(v1) + exp(v2) + exp(v3) )
  • We want to estimate the parameters that best calculate the v
    • m1 <- mlogit(choice ~ 0 + seat + price, data=sportscar, print.level=3) # find parameters that maximize likelihood, print all iterations as per print.level=3
    • summary(m1)
  • The mlogit requires a specific data format
    • sportscar <- mlogit.data(sportscar.df, shape=“long”, choice=“choice”, varying=5:8, alt.var=“alt”) # what was chose, what attributes vary, and what is the alternative number column

Interpreting choice model parameters:

  • Coefficients for each attribute are multiplied by the level of that attribute to create the v1/v2/v3
  • Factor variables are converted to numbers in the model.matrix process - need to understand what is the zero level
    • head(model.matrix(m2))
    • head(sportscar)
  • May want to convert to factors even if the data are numeric
    • sportscar\(seat <- as.factor(sportscar\)seat)
    • m3 <- mlogit(choice ~ 0 + seat + trans + convert + price, data=sportscar)
    • summary(m3)
  • Can make price response non-linear (factor) or keep linear and assume Willingness to Pay
    • coef(m3)
    • coef(m3)/-coef(m3)[5] # assumes that price is the 5th element of the coef vector

Intercepts and interactions:

  • There is no intercept in the v1, v2, v3, since adding the same constant to each of them cancels out once the exponentials are taken
    • The intercept is not identified, and so fixing the intercept using ~ 0 + is preferred (model will just pick a random one otherwise)
  • An interaction term is based on the multiplication of several terms from the original dataset
    • m4 <- mlogit(choice ~ 0 + seat + trans + convert + price + trans:price, data=sportscar) # interaction between transmission and price
    • m4 <- mlogit(choice ~ 0 + seat + trans*convert + price, data=sportscar) # same command as above, less typing
  • Interpreting the standard errors along with the z-values and p-values is similar to any other regression
  • Can also add segments, as long as they interact with at least one of the attributes
    • m5 <- mlogit(choice ~ 0 + seat + convert + trans + price:segment, data=sportscar)

Predicting shares:

  • Share predictions can be a good way to communicate preferences
  • Begin by creating vectors of attributes of interest
    • price <- c(35, 30)
    • seat <- factor(c(2, 2), levels=c(2,4,5))
    • trans <- factor(c(“manual”, “auto”), levels=c(“auto”, “manual”))
    • convert <- factor(c(“no”, “no”), levels=c(“no”, “yes”))
    • segment <- factor(c(“basic”, “basic”), levels=c(“basic”, “fun”, “racer”))
    • prod <- data.frame(seat, trans, convert, price, segment)
  • Use a model prediction to make the share predictions
    • m5 <- mlogit(choice ~ 0 + seat + convert + trans + price:segment, data=sportscar)
    • prod.coded <- model.matrix(update(m5$formula, 0 ~ .), data = prod)[,-1]
    • v <- prod.coded %*% m5$coef
    • p <- exp(u) / sum(exp(u))
    • cbind(p, prod)
    • predict_mnl <- function(model, products) {
    • data.model <- model.matrix(update(model$formula, 0 ~ .), data = products)[,-1]
    • utility <- data.model%*%model$coef
    • share <- exp(utility)/sum(exp(utility))
    • cbind(share, products)
    • }
    • shares <- predict_mnl(m5, products)
    • barplot(shares$share, horiz = TRUE, col=“tomato2”, xlab = “Predicted Market Share”, names.arg = c(“Our Sportscar”, “Competitor 1”))

Example code includes:

# use mlogit.data() to convert chocolate to mlogit.data
chocolate_df <- mlogit.data(chocolate, shape = "long",
                            choice = "Selection", alt.var = "Alt", 
                            varying = 6:8)
                         
# use str() to confirm that chocolate is an mlogit.data object
str(chocolate_df)


# Fit a model with mlogit() and assign it to choc_m1
choc_m1 <- mlogit(Selection ~ Brand + Type + Price, data=chocolate_df, print.level=3)

# Summarize choc_m1 with summary()
summary(choc_m1)


# modify the call to mlogit to exclude the intercept
choc_m2 <- mlogit(Selection ~ 0 + Brand + Type + Price, data = chocolate_df, print.level=3)

# summarize the choc_m2 model
summary(choc_m2)


# compute the wtp by dividing the coefficient vector by the negative of the price coefficient
coef(choc_m2) / -coef(choc_m2)["Price"]


# change the Price variable to a factor in the chocolate data
chocolate$Price <- as.factor(chocolate$Price)

# fit a model with mlogit and assign it to choc_m3
choc_m3 <- mlogit(Selection ~ 0 + Brand + Type + Price, data=chocolate)

# inspect the coefficients
summary(choc_m3)


# likelihood ratio test comparing two models
lrtest(choc_m2, choc_m3)


# add the formula for mlogit
choc_m4 <- mlogit(Selection ~ 0 + Brand + Type + Price + Brand:Type, data=chocolate)

# use summary to see the coefficients
summary(choc_m4)


# add the formula for mlogit
choc_m5 <- mlogit(Selection ~ 0 + Brand + Type + Price + Price:Trial, data=chocolate)

# use summary to see the outputs
summary(choc_m5)


# add the formula for mlogit
choc_m5 <- mlogit(Selection ~ 0 + Brand + Type + Price + Price:Trial, data=chocolate)

# use summary to see the outputs
summary(choc_m5)


predict_mnl <- function(model, products) {
  data.model <- model.matrix(update(model$formula, 0 ~ .), 
                             data = products)[,-1]
  utility <- data.model%*%model$coef
  share <- exp(utility)/sum(exp(utility))
  cbind(share, products)
}

# modify the code below so that the segement is set to "racer" for both alternatives
price <- c(35, 30)
seat <- factor(c(2, 2), levels=c(2,4,5))
trans <- factor(c("manual", "auto"), levels=c("auto", "manual"))
convert <- factor(c("no", "no"), levels=c("no", "yes"))
segment <- factor(c("racer", "racer"), levels=c("basic", "fun", "racer"))
prod <- data.frame(seat, trans, convert, price, segment)

# predict shares for the "racer" segment
predict_mnl(model=m5, products=prod)


# fit the choc_m2 model
choc_m2 <- mlogit(Selection ~ 0 + Brand + Type + Price, data=chocolate)

# create a data frame with the Ghiradelli products
Brand <- factor(rep("Ghirardelli", 5), level = levels(chocolate$Brand))
Type <- levels(chocolate$Type)
Price <- 3   # treated as a number in choc_m2
ghir_choc <- data.frame(Brand, Type, Price)

# predict shares
predict_mnl(model=choc_m2, products=ghir_choc)


# compute and save the share prediction 
shares <- predict_mnl(choc_m2, ghir_choc)

# make a barplot of the shares
barplot(shares$share, 
        horiz = TRUE, col="tomato2",
        xlab = "Predicted Market Share", 
        main = "Shares for Ghiradelli chocolate bars at $3 each", 
        names.arg = levels(chocolate$Type)
        )

Chapter 4 - Hierarchical Choice Models

What is a hierarchical choice model?

  • Hierarchical choice models (random coefficient models) account for differences in preferences across entities (heterogeneity)
  • An assumption is made that each individual is pulled from a distribution
    • for (i in 1:n_resp) {
    • beta[i] <- mvrnorm(1, beta_0, Sigma) # random normal vector
    • for (j in 1:n_task[i]) {
    •   X <- X[X$resp == i & X$task == j, ]  
    •   u <- X %*% beta[i]  
    •   p[i,] <- exp(u) / sum(exp(u))  
    • }
    • }
  • Can fit the hierarchical model using the mlogit() function
    • sportscar <- mlogit.data(sportscar, choice=“choice”, shape=“long”, varying=5:8, alt.var=“alt”, id.var = “resp_id”)
    • m7 <- mlogit(choice ~ 0 + seat + trans + convert + price, data = sportscar, rpar = c(price = “n”), panel = TRUE) # rpar=c(price=“n”) will nornmally distribute price parameter across people
    • summary(m7)
    • plot(m7) # plotting function for mlogit objects

Heterogeneity in preferences for other features:

  • Might have heterogeneity in choices on many other attributes, including factor data
  • A different manner of coding factors can work better for hierarchical models
    • Effects coding has -1, 0, 1 so that the factors are relative to average rather than relative to the first factor
    • contrasts(sportscar\(seat) <- contr.sum(levels(sportscar\)seat)) # stores effects coding with the data frame
    • dimnames(contrasts(sportscar\(seat))[[2]] <- levels(sportscar\)seat)[1:2] # improve readability
  • Can make all of the coefficients heterogeneous
    • my_rpar <- c(“n”, “n”, “n”, “n”, “n”) # make them all normal
    • m3 <- mlogit(choice ~ 0 + seat + trans + convert + price, data=sportscar) # get coefficient names
    • names(my_rpar) <- names(m3$coefficients) # assign them to my_rpar
    • m8 <- mlogit(choice ~ 0 + seat + trans + convert + price, data = sportscar, panel = TRUE, rpar = my_rpar) # fit the model
    • plot(m8, par=c(“seat4”, “seat5”))
    • -sum(m8$coef[1:2]) # can get the 2-seat coefficient since it no longer needs to be 0

Predicting shares with hierarchical models:

  • Can predict shares with a hierarchical model, including those where decision-making preferences may be correlated
    • m10 <- mlogit(choice ~ 0 + seat + trans + convert + price, data = sportscar, rpar = myrpar, panel=TRUE, correlation = TRUE)
    • cor.mlogit(m10)
    • mean <- m10$coef[1:5] # hard coded
    • Sigma <- cov.mlogit(m10)
    • share <- matrix(NA, nrow=1000, ncol=nrow(prod.coded))
    • for (i in 1:1000) {
    • coef <- mvrnorm(1, mu=mean, Sigma=Sigma)
    • utility <- prod.coded %*% coef
    • share[i,] <- exp(utility) / sum(exp(utility))
    • }
    • cbind(colMeans(share), prod)
  • Niche product shares tend to increase when heterogeneity is included - each element of the niche may be a small preference, but correlated with preference for other elements of the niche

Wrap up:

  • Decisions need to be made in building a choice model - attributes, numeric vs. factors, hierarchical, distributions, nesting, etc.
  • Can start with basic models and build out as needed
    • Inspect the data - investigate a few choices to confirm understanding of the data
    • Run the model and inspect the standard errors - if too high, simplify
    • Heterogeneity is a good idea whenever the decisions are being made by humans

Example code includes:

# Determine the number of subjects in chocolate$Subjects
length(levels(chocolate$Subject))


# add id.var input to mlogit.data call
chocolate <- mlogit.data(chocolate, choice = "Selection", shape="long", 
                         varying=6:8, alt.var = "Alt", id.var = "Subject"
                         )
                         
# add rpar and panel inputs to mlogit call
choc_m6 <- mlogit(Selection ~ 0 + Brand + Type + Price, data = chocolate, 
                  rpar = c(Price="n"), panel=TRUE)

# plot the model
plot(choc_m6)


# set the contrasts for Brand to effects code
contrasts(chocolate$Brand) <- contr.sum(levels(chocolate$Brand))
dimnames(contrasts(chocolate$Brand))[[2]] <- levels(chocolate$Brand)[1:4]
contrasts(chocolate$Brand)

# set the contrasts for Type to effects code
contrasts(chocolate$Type) <- contr.sum(levels(chocolate$Type))
dimnames(contrasts(chocolate$Type))[[2]] <- levels(chocolate$Type)[1:4]
contrasts(chocolate$Type)


# create my_rpar vector
choc_m2 <- mlogit(Selection ~ 0 + Brand + Type + Price, data=chocolate)
my_rpar <- rep("n", length(choc_m2$coef))
names(my_rpar) <- names(choc_m2$coef)
my_rpar

# fit model with random coefficients
choc_m7 <- mlogit(Selection ~ 0 + Brand + Type + Price, data=chocolate, rpar=my_rpar, panel=TRUE)


# print the coefficients 
choc_m7$coef[5:8]

# compute the negative sum of those coefficients
-sum(choc_m7$coef[5:8])


# Extract the mean parameters and assign to mean
mean <- choc_m8$coef[1:9]   

# Extract the covariance parameters and assign to Sigma
Sigma <- cov.mlogit(choc_m8) 

# Create storage for individual draws of share
share <- matrix(NA, nrow=1000, ncol=nrow(choc_line_coded))

# For each draw (person)
for (i in 1:1000) { 
  # Draw a coefficient vector
  coef <- mvrnorm(1, mu=mean, Sigma=Sigma)
  # Compute utilities for those coef
  utility <- choc_line_coded %*% coef
  # Compute probabilites according to logit formuila
  share[i,] <- exp(utility) / sum(exp(utility))
}  

# Average the draws of the shares
cbind(colMeans(share), choc_line)

Single-Cell RNA-Seq Workflows in R

Chapter 1 - What Is RNA Single-Cell RNA-Seq?

Background and utility:

  • Can get gene expression data at the cellular level - allows for better resolution of gene expressions
    • Previous methods would get a mix of gener expressions across many cells - better for averages and distributions, but cannot identify a specific observation or cell type
    • Implications are significant for personalized medicine and related areas
  • The data structure from the lab is a matrix - geners are rows and cells are columns
    • ATCG counts by intersection
    • Gene-level covariates
    • Cell-level covariates
  • There are many zeroes - gene not expressed in cell, or dropouts (technical errors)

Typical workflow:

  • There has been an exponential scaling in the ability to extract RNA data from 2009 - 2018
    • Full-range technologies try to capture the full RNA sequence
    • Tide-based (?) technologies try to capture the two ends of an RNA technology
  • Quality control is the process of removing problematic cells - library size and cell coverage metrics
    • Library size is the total number of reads assigned to each cell
    • Coverage is the total number of cells with reads assigned for that gene
  • Workflows then include normalization, dimensionality reduction, clustering, DE analysis (biomarkers with differential expression)

Load, create, and access data:

  • The SingleCellExperiment (SCE) class is an S4 class for storing data from single-cell experiments
    • source(“https://bioconductor.org/biocLite.R”)
    • biocLite(“SingleCellExperiment”)
    • library(SingleCellExperiment)
    • counts <- matrix(rpois(8, lambda = 10), ncol = 2, nrow = 4)
    • rownames(counts) <- c(“Lamp5”, “Fam19a1”, “Cnr1”, “Rorb”)
    • colnames(counts) <- c(“SRR2140028”, “SRR2140022”)
    • counts
  • Can create SCE data using the constructor
    • sce <- SingleCellExperiment(assays = list(counts = counts), rowData = data.frame(gene = rownames(counts)), colData = data.frame(cell = colnames(counts)))
  • Can create SCE data using coercion
    • se <- SummarizedExperiment(assays = list(counts = counts))
    • sce <- as(se, “SingleCellExperiment”)
  • Can apply process to a real dataset
    • library(scRNAseq); data(allen) # subset of mouse visual coretex data

Example code includes:

# head of count matrix
counts[1:3, 1:3]

# count of specific gene and cell
alignedReads <- counts['Cnr1', "SRR2140055"]

# overall percentage of zero counts 
pZero <- mean(counts == 0)

# cell library size
libsize <- colSums(counts)


# find cell coverage
coverage <- colMeans(counts > 0)
cell_info$coverage <- coverage

# load ggplot2
library(ggplot2)

# plot cell coverage
ggplot(cell_info, aes(x = names, y = coverage)) + 
  geom_point() +
  ggtitle('Cell Coverage') + 
  xlab('Cell Name') + 
  ylab('Coverage')


# mean of GC content
gc_mean <- mean(gene_info$gc)

# standard deviation of GC content
gc_sd <- sd(gene_info$gc)

# boxplot of GC content 
boxplot(gene_info$gc, main = 'Boxplot - GC content', ylab = 'GC content')


# batch
batch <- cell_info$batch

# patient
patient <- cell_info$patient

# nesting of batch within patient
batch_patient <- table(batch = batch, patient = patient)

# explore batch_patient
batch_patient


# load SingleCellExperiment
library(SingleCellExperiment)

# create a SingleCellExperiment object
sce <- SingleCellExperiment(assays = list(counts = counts ),
                            rowData = data.frame(gene_names = rownames(counts)),
                            colData = data.frame(cell_names = colnames(counts)))


# create a SummarizedExperiment object from counts
se <- SummarizedExperiment(assays = list(counts = counts))

# create a SingleCellExpression object from se
sce <- as(se, "SingleCellExperiment")


# create SingleCellExperiment object
sce <- as(allen, "SingleCellExperiment")

# cell information
cell_info <- colData(sce)

# size factors
sizeFactors(sce) <- colSums(assay(sce))

Chapter 2 - Quality Control and Normalization

Quality Control:

  • Need to remove low quality cells and genes - identify first
  • Tung dataset includes three replicates
    • sce
    • library(scater)
    • sce <- calculateQCMetrics(sce, feature_controls = list(ERCC = isSpike(sce, “ERCC”))
  • ERCC spiking genes are used for quality control - filter out cells with improper ratios (usually too much due to dead or stressed or the like) of “spiking”
  • Key functions used in the exercises include
    • calculateQCMetrics()
    • counts()
    • rowSums()
    • grepl()
    • isSpike()
    • plot(density(x))
    • abline()

Quality Control (continued):

  • Can filter based on library sizes using
    • threshold <- 20000
    • plot(density(sce$total_counts), main = “Density - total_counts”)
    • abline(v = threshold)
    • keep <- (sce$total_counts > threshold)
    • table(keep)
  • Can look at plots of the data
    • scater::plotPhenoData(sce, aes_string(x = “total_counts”, y = “total_counts_ERCC”, colour = “batch”))
  • Can then filter based on data that do not meet key criteria in the plot
    • keep <- (sce$batch != “NA19098.r2”)
    • table(keep)
    • filter_genes <- apply(counts(sce), 1, function(x) length(x[x >= 2] >= 2) # keep genes with counts of 2+ in at least 2+ cells
    • table(filter_genes)

Normalization:

  • Want to group cells based on their gene expression profiles; target with different drugs, for example
  • Technical artifacts can be introduced in the data
    • Batch effect is a common problem with the data - cells may cluster by batch, even if they are from the same entity
    • Can normalize based on dividing by library size (multiplied if needed to get count per million - CPM)
  • Exercise will use the following functions
    • plotPCA()
    • reducedDim(sce, “PCA”)[, 1:2]
    • computeSumFactors()
    • sizeFactors()
    • assays()
    • normalize()
    • plotRLE()

Example code includes:

# remove genes with only zeros
nonZero <- counts(sce) > 0
keep <- rowSums(nonZero) > 0
sce_2 <- sce[keep, ]

# spike-ins ERCC
isSpike(sce_2, "ERCC") <- grepl("^ERCC-", rownames(sce_2))


# load scater
library(scater)

# calculate QCs
sce <- calculateQCMetrics(sce, feature_controls = list(ERCC = isSpike(sce, "ERCC")))

# explore coldata of sce
colData(sce)


# set threshold
threshold <- 20000

# plot density
plot(density(sce@colData$total_counts), main = 'Density - total_counts')
abline(v = threshold)

# keep cells
keep <- sce@colData$total_counts > threshold

# tabulate kept cells
table(keep)


# set threshold
threshold <- 6000

# plot density
plot(density(sce$total_features), main = 'Density - total_features')
abline(v=threshold)

# keep cells
keep <- sce$total_features > threshold

# tabulate kept cells
table(keep)


#extract cell data into a data frame
cDataFrame <- as.data.frame(colData(sce))

# plot cell data
ggplot(cDataFrame, aes(x = total_counts, y = total_counts_ERCC, col = batch)) + 
  geom_point()

# keep cells
keep <- sce$batch != "NA19098.r2"

# tabulate kept cells
table(keep)


# load SingleCellExperiment
library(SingleCellExperiment)

# filter genes
filter_genes <- apply(counts(sce), 1, function(x){
  length(x[x > 1]) > 1
})

# tabulate the results of filter_genes
table(filter_genes)


# PCA raw counts
plotPCA(sce, exprs_values = "counts",
    colour_by = "batch", shape_by = "individual")

# PCA log counts
plotPCA(sce, exprs_values = "logcounts_raw",
        colour_by = "batch", shape_by = "individual")


#find first 2 PCs
pca <- reducedDim(sce, "PCA")[, 1:2]

#create cdata
cdata <- data.frame(PC1 = pca[, 1],
                    libsize = sce$total_counts,
                    batch = sce$batch)

#plot pc1 versus libsize
ggplot(cdata, aes(x = PC1, y = libsize, col = batch)) +
  geom_point()


# load scran
library(scran)

# find size factors
sce <- computeSumFactors(sce)

# display size factor histogram
hist(sizeFactors(sce))


# view assays
assays(sce)

# normalize sce
normalized_sce <- normalize(sce)

# view new assay for normalized logcounts
assays(normalized_sce)

Chapter 3 - Visualization and Dimensionality Reduction

Mouse Epithelium Dataset:

  • Goal is to reduce the number of dimensions (from number of genes to something much smaller)
  • Mouse olfactory cell dataset - epithelium stem cell differentiation
    • Dimensionlaity reduction makes for smaller data with preservation of signal much more so than noise

Visualization:

  • Can visualize datasets using dimensionality reduction through any of several methods - PCA, tSNE, ZIFA, ZINB-WaVE
    • plotPCA(sce, exprs_values = “logcounts”, shape_by = “Batch”, colour_by = “publishedClusters”) # los help reduce bias towards highly expressed genes
    • plotTSNE(sce, exprs_values = “logcounts”, shape_by = “Batch”, colour_by = “publishedClusters”, perplexity = 5) # perplexity is a guess about the kNN parameter

Dimensionality Reduction:

  • Can find the most variable genes using magrittr
    • library(magrittr)
    • vars <- assay(sce) %>% log1p %>% rowVars
    • names(vars) <- rownames(sce)
    • vars <- sort(vars, decreasing = TRUE)
    • head(vars)
    • sce_sub <- sce[names(vars[1:50]),]
    • sce_sub
  • Can run dimensionality reduction using PCA
    • logcounts <- log1p(assay(sce_sub))
    • pca <- prcomp(t(logcounts))
    • reducedDims(sce_sub) <- SimpleList(PCA = pca$x)
    • sce_sub
    • head(reducedDim(sce_sub, “PCA”)[, 1:2])
  • Can then plot the PCA components
    • pca <- reducedDim(sce_sub, “PCA”)[, 1:2]
    • col <- colData(sce)[, c(“publishedClusters”, “batch”)]
    • df <- cbind(pca, col)
    • ggplot(df, aes(x = PC1, y = PC2, col = publishedClusters, shape = batch)) +
    • geom_point()

Example code includes:

# find dimensions
mydims <- dim(sce)

# extract cell and gene names
cellNames <- colnames(sce)
geneNames <- rownames(sce)


# cell data
cData <- colData(sce)

#print column names
colnames(cData)

# table batch & clusters
cData <- cData[, c('Batch', 'publishedClusters')]

#tabulate cData
table(cData)


# load scater
library(scater)

# plot pc1 and pc2 counts
plotPCA(
    object = sce,
    exprs_values = "counts",
    shape_by = "Batch",
    colour_by = "publishedClusters"
)


# explore initial assays
assays(sce)

# create log counts
logcounts <- log1p(assays(sce)$counts)

# add log counts
assay(sce, 'logcounts') <- logcounts
assays(sce)

# pca log counts
plotPCA(object = sce, exprs_values = "logcounts",
    shape_by = "Batch", colour_by = "publishedClusters")


# default tSNE
plotTSNE(
    sce,
    exprs_values = "counts",
    shape_by = "publishedClusters",
    colour_by = "Batch",
    perplexity = 5
)


# gene variance 
vars <- assay(sce) %>% log1p() %>% rowVars() 

#rename vars
names(vars) <- rownames(sce)

#sort vars
vars_2 <- sort(vars, decreasing = TRUE)
head(vars_2)

# subset sce 
sce_sub <- sce[names(vars[1:50]), ]
sce_sub


# log counts
logcounts <- log1p(assays(sce_sub)$counts)

# transpose
tlogcounts <- t(logcounts)

# perform pca
pca <- prcomp(tlogcounts)

# store pca matrix in sce
reducedDims(sce_sub) <- SimpleList(PCA = pca$x)
head(reducedDim(sce_sub, "PCA")[, 1:2])


# Extract PC1 and PC2 and create a data frame
pca <- reducedDim(sce_sub, "PCA")[, 1:2]
col_shape <- data.frame(publishedClusters = colData(sce)$publishedClusters, Batch = factor(colData(sce)$Batch))
df <- cbind(pca, col_shape)

# plot PC1, PC2
ggplot(df, aes(x = PC1, y = PC2, 
            colour = publishedClusters, 
            shape = Batch)) + 
  geom_point()

Chapter 4 - Cell Clustering and Differential Expression Analysis

Clustering methods for scRNA-Seq:

  • Continuing to use the mouse epithelium dataset - cells color coded by cluster as per previous chapters
  • One of the goals of clustering is to group cells with similar gene expression, allowing for finding patterns in gene expression
    • Hierarchical clustering
    • k-means clustering
  • Challenges include setting the number of clusters, scalability to large datasets, etc.
  • Can begin by creating the Seurat object
    • library(Seurat)
    • library(SingleCellExperiment)
    • seuset <- CreateSeuratObject(
    • raw.data = assay(sce),
    • normalization.method = “LogNormalize”,
    • scale.factor = 10000,
    • meta.data = as.data.frame(colData(sce))
    • )
    • seuset <- ScaleData(object = seuset)
    • seuset
  • Can then perform clustering on the seuset object
    • seuset <- FindClusters( object = seuset, reduction.type = “pca”, dims.use = 1:10, resolution = 1.8, print.output = FALSE )
    • PCAPlot( object = seuset, group.by = “ident”, pt.shape = “publishedClusters” )

Differential expression analysis:

  • Differential expression (DE) analysis is to find differential expression of genes in various cells
    • Methods include SCDE, MAST, edgeR, DESeq2, etc.
  • Can fit a MAST model using function zlm
    • library(MAST)
    • zlm <- zlm(~ celltype + cngeneson, sce)
    • summary <- summary(zlm, doLRT = “celltype9”)
    • summary
    • fit <- summary$datatable
    • fit <- merge(fit[contrast==‘celltype9’ & component==‘H’, .(primerid, Pr(>Chisq))], fit[contrast==‘celltype9’ & component==‘logFC’, .(primerid, coef)], by=‘primerid’)
    • fit[, padjusted:=p.adjust(Pr(>Chisq), ‘fdr’)]
    • res = data.frame(gene = fit\(primerid, pvalue = fit[,'Pr(>Chisq)'], padjusted = fit\)padj, logFC = fit$coef)
    • head(res)

Visualization of DE genes:

  • Visualization is typically the final step of single-cell analysis
  • The volcano plot looks at fold-change and p-values simultaneously
    • ggplot(res, aes(x=logFC, y=-log10(padjusted), color=mostDE)) + geom_point() +
    • ggtitle(“Volcano”) + xlab(“log2 FC”) + ylab(“-log10 adjusted p-value”)
  • Can also look at results of DE using a heatmap
    • library(NMF)
    • norm <- assay(sce[mostDE, ], “logcounts”)
    • norm <- as.matrix(norm)
    • aheatmap(norm, annCol = colData(sce)$publishedClusters)
  • Course covered the typical workflow for analysis of single-cell RNA sequencing data
    • Normalization
    • Dimensionality reduction
    • Clustering
    • Differential expression analysis

Example code includes:

# load Seurat
library(Seurat)

#create seurat object
seuset <- CreateSeuratObject(
    raw.data = assay(sce),
    normalization.method = "LogNormalize", 
    scale.factor = 10000,
    meta.data = as.data.frame(colData(sce))
)

# scale seuset object
scaled_seuset <- ScaleData(object = seuset)


# perform pca
seuset <- RunPCA(
    object = seuset, 
    pc.genes = rownames(seuset@raw.data), 
    do.print = FALSE
)
# plot pca
PCAPlot(object = seuset,
        pt.shape = 'Batch',
        group.by = 'publishedClusters')


# load MAST
library(MAST)

# SingleCellAssay object 
sca

# fit zero-inflated regression 
zlm <- zlm(~ celltype + cngeneson, sca) 

# summary with likelihood test ratio
summary <- summary(zlm, doLRT = "celltype9")


# get summary table
fit <- summary$datatable

# pvalue df
pvalue <- fit[contrast == 'celltype9' & component == 'H', .(primerid, `Pr(>Chisq)`)]
  
# logFC df
logFC <- fit[contrast == 'celltype9' & component == 'logFC', .(primerid, coef)]

# pvalues and logFC
fit <- merge(pvalue, logFC, by = 'primerid')


# adjusted pvalues
fit[, padjusted:=p.adjust(`Pr(>Chisq)`, 'fdr')]

# result table
res <- data.frame(gene = fit$primerid,
                 pvalue = fit[,'Pr(>Chisq)'],
                 padjusted = fit$padj,
                 logFC = fit$coef)


# most DE 
res <- res[order(res$padjusted), ]
mostDE <- res$gene[1:20]
res$mostDE <- res$gene %in% mostDE

# volcano plot
ggplot(res, aes(x=logFC, y=-log10(padjusted), color=mostDE)) +
  geom_point() +
  ggtitle("Volcano plot") +
  xlab("log2 fold change") + 
  ylab("-log10 adjusted p-value")


# load NMF
library(NMF)

# normalize log counts
norm <- assay(sce[mostDE, ], "logcounts")
mat <- as.matrix(norm)

# heatmap
aheatmap(mat, annCol = colData(sce)$publishedClusters)

Differential Expression Analysis in R with limma

Chapter 1 - Differential Expression Analysis

Differential expression analysis:

  • Analysis of data from functional genomics experiments
  • Example of having treated cells with phenotypes A and B
    • The features can be genes or proteins or other molecular features of the cell - proxy for relative abundance (such as RNA)
    • Upregulated - higher expression level
    • Downregulated - lower expression level
  • Objectives are to look for novelty (genes that play an unexpected role), context (interpreting relevance of gene behaviors), systems level understanding (simultaneous allows for looking at pathways)
  • Many caveats to the analysis - study design is very important

Differential expression data:

  • Data will be from the breast cancer data and CLL data - testing for differences in two groups of people within each
    • x - Expression matrix
    • f - Feature data - genes or proteins
    • p - Phenotype data - description of each of the samples
  • Can begin by looking at the boxplot for a single gene
    • boxplot(x[1, ] ~ p[, “er”], main = f[1, “symbol”])

ExpressionSet class:

  • Data management can become precarious, especially when filtering and subsetting
  • Object-oriented programming can help - the class can hold the data, and has methods/functions that work specially on objectes of that class
    • Accessors (getters) get the data
    • Setters modify the stored data
    • source(“https://bioconductor.org/biocLite.R”)
    • biocLite(“Biobase”)
    • library(Biobase)
    • eset <- ExpressionSet(assayData = x, phenoData = AnnotatedDataFrame(p), featureData = AnnotatedDataFrame(f))
  • Can access data from an ExpressionSet object
    • x <- exprs(eset)
    • f <- fData(eset)
    • p <- pData(eset)
    • eset_sub <- eset[1000, 1:10]
    • boxplot(exprs(eset)[1, ] ~ pData(eset)[, “er”], main = fData(eset)[1, “symbol”])

The limma package:

  • Advantages of the limma package include replacing boiler-plate code and improved inference by sharing across genes (do not assume full independene - all from same experiment)
    • Method used is empirical Bayes - convenience and better statistics (especially for smaller data sets)
    • Good functions for pre-processing and post-processing
  • Specifying a linear model - Y = B0 + B1 * X1 + epsilon where Y is the expression level of the gene, B0 is the mean in ER- tumors, and B1 is the mean difference in expression in ER+ tumors
    • model.matrix(~, data = )
    • design <- model.matrix(~er, data = pData(eset))
    • colSums(design)
  • Can then test the design matrix using the standard limma pipeline
    • library(limma)
    • fit <- lmFit(eset, design)
    • fit <- eBayes(fit)
    • results <- decideTests(fit[, “er”])
    • summary(results) # -1 will be downregulated and +1 will be upregulated

Example code includes:

# Create a boxplot of the first gene in the expression matrix
boxplot(x[1, ] ~ p[, "Disease"], main = f[1, "symbol"])


# Load package
library(Biobase)

# Create ExpressionSet object
eset <- ExpressionSet(assayData = x,
                      phenoData = AnnotatedDataFrame(p),
                      featureData = AnnotatedDataFrame(f))

# View the number of features (rows) and samples (columns)
dim(eset)


# Subset to only include the 1000th gene (row) and the first 10 samples
eset_sub <- eset[1000, 1:10]

# Check the dimensions of the subset 
dim(eset_sub)

# Create a boxplot of the first gene in eset_sub
boxplot(exprs(eset_sub)[1, ] ~ pData(eset_sub)[, "Disease"],
        main = fData(eset_sub)[1, "symbol"])


# Create design matrix for leukemia study
design <- model.matrix(~Disease, data = pData(eset))

# Count the number of samples modeled by each coefficient
colSums(design)


# Load package
library(limma)

# Fit the model
fit <- lmFit(eset, design)

# Calculate the t-statistics
fit <- eBayes(fit)

# Summarize results
results <- decideTests(fit[, "Diseasestable"])
summary(results)

Chapter 2 - Flexible Models for Common Study Designs

Flexible linear models:

  • The models can be extended, for example to Y = B0 + B1X1 + B2X2 + eps
    • This model is known as a treatment-contrast model
  • Can instead use a group-means parameterization - Y = B1X1 + B2X2 + eps
    • Can then test whether B1-B2 == 0 since the intercept was exluded
    • design <- model.matrix(~0 + er, data = pData(eset))
    • cm <- limma::makeContrasts(status = erpositive - ernegative, levels = design) # erpositive-ernegative means multiply erpositive by 1 and multiply ernegative by -1
    • fit <- lmFit(eset, design) # per above
    • fit2 <- contrasts.fit(fit, contrasts = cm) # for the contrasts method
    • fit2 <- eBayes(fit2)
    • results <- decideTests(fit2)
    • summary(results)

Studies with more than two groups:

  • Dataset for this example has groups with leukemia types ALL, AML, CML - 20172x36 (12 leukemias of each type)
  • Desire to build a group-means model - Y = B1X1 + B2X2 + B3*X3 + eps
    • design <- model.matrix(~0 + type, data = pData(eset))
    • cm <- limma::makeContrasts(AMLvALL = typeAML - typeALL, CMLvALL = typeCML - typeALL, CMLvAML = typeCML - typeAML, levels = design)
    • fit <- lmFit(eset, design)
    • fit2 <- contrasts.fit(fit, contrasts = cm)
    • fit2 <- eBayes(fit2)
    • results <- decideTests(fit2)
  • Exercises will look at gene expressions of stem cells grown in states of hypoxia

Factorial experimental design:

  • Factorial designs look at every combination of experimental variables - for example, if there is a 2x2, then there would be 4 combinations examined
  • Example of 2x2 study in plants of types col, vte2 for temperatures of high, low - 11871 x 12
  • Can run the group-means model with the zero intercept; need to first create the type-temperature variable using paste
    • group <- with(pData(eset), paste(type, temp, sep = “.”))
    • group <- factor(group) # records the unique levels
    • design <- model.matrix(~0 + group)
    • colnames(design) <- levels(group)
  • May want to assess the interaction effect (difference in impact of temperature by type) as well as the direct effects (impact of temperature on a specific type, impact for same temperature across types)
    • cm <- makeContrasts(type_normal = vte2.normal - col.normal, type_low = vte2.low - col.low, temp_vte2 = vte2.low - vte2.normal, temp_col = col.low - col.normal, interaction = (vte2.low - vte2.normal) - (col.low - col.normal), levels = design)
    • fit <- lmFit(eset, design)
    • fit2 <- contrasts.fit(fit, contrasts = cm)
    • fit2 <- eBayes(fit2)
    • results <- decideTests(fit2)

Example code includes:

# Create design matrix with no intercept
design <- model.matrix(~0 + Disease, data = pData(eset))

# Count the number of samples modeled by each coefficient
colSums(design)


# Load package
library(limma)

# Create a contrasts matrix
cm <- makeContrasts(status = Diseaseprogres. - Diseasestable, levels = design)

# View the contrasts matrix
cm


# Load package
library(limma)

# Fit the model
fit <- lmFit(eset, design)

# Fit the contrasts
fit2 <- contrasts.fit(fit, contrasts = cm)

# Calculate the t-statistics for the contrasts
fit2 <- eBayes(fit2)

# Summarize results
results <- decideTests(fit2)
summary(results)


# Create design matrix with no intercept
design <- model.matrix(~0 + oxygen, data = pData(eset))

# Count the number of samples modeled by each coefficient
colSums(design)


# Load package
library(limma)

# Create a contrasts matrix
cm <- makeContrasts(ox05vox01 = oxygenox05 - oxygenox01,
                    ox21vox01 = oxygenox21 - oxygenox01,
                    ox21vox05 = oxygenox21 - oxygenox05,
                    levels = design)

# View the contrasts matrix
cm


# Load package
library(limma)

# Fit the model
fit <- lmFit(eset, design)

# Fit the contrasts
fit2 <- contrasts.fit(fit, contrasts = cm)

# Calculate the t-statistics for the contrasts
fit2 <- eBayes(fit2)

# Summarize results
results <- decideTests(fit2)
summary(results)


# Create single variable
group <- with(pData(eset), paste(type, water, sep = "."))
group <- factor(group)

# Create design matrix with no intercept
design <- model.matrix(~0 + group)
colnames(design) <- levels(group)

# Count the number of samples modeled by each coefficient
colSums(design)


# Load package
library(limma)

# Create a contrasts matrix
cm <- makeContrasts(type_normal = nm6.normal - dn34.normal,
                    type_drought = nm6.drought - dn34.drought,
                    water_nm6 = nm6.drought - nm6.normal,
                    water_dn34 = dn34.drought - dn34.normal,
                    interaction = (nm6.drought - nm6.normal) - (dn34.drought - dn34.normal),
                    levels = design)

# View the contrasts matrix
cm


# Load package
library(limma)

# Fit the model
fit <- lmFit(eset, design)

# Fit the contrasts
fit2 <- contrasts.fit(fit, contrasts = cm)

# Calculate the t-statistics for the contrasts
fit2 <- eBayes(fit2)

# Summarize results
results <- decideTests(fit2)
summary(results)

Chapter 3 - Pre-processing and post-processing

Normalizing and filtering:

  • Need to convert raw data to analysis-ready data - generic approach for first-pass to new dataset
    • Log transformation
    • Quantile normalization
    • Filtering
  • Can start with visualization of densities using the limma package
    • limma::plotDensities(eset, legend = FALSE)
    • exprs(eset) <- log(exprs(eset)) # log transform
    • limma::plotDensities(eset, legend = FALSE)
  • Want to remove technical artifacts using quantile normalization
    • exprs(eset) <- normalizeBetweenArrays(exprs(eset))
    • limma::plotDensities(eset, legend = FALSE)
    • abline(v = 5) # from visualization, 5 may be a cutoff where the data should be kept
    • keep <- rowMeans(exprs(eset)) > 5
    • eset <- eset[keep, ]
    • plotDensities(eset, legend = FALSE)

Accounting for technical batch effects:

  • Technical batch effects are artifacts arising from differences in experiments - true for all experiment types, including functional genomics
    • Need to balance variables of interest across batches - cannot just take type a from batch a and type b from type b and see if there are differences
    • PCA and other dimension reduction techniques can help identify technical batch effects
    • limma::plotMDS(eset, labels = pData(eset)[, “time”], gene.selection = “common”)
  • Removing batch effects is also possible in limma
    • exprs(eset) <- limma::removeBatchEffect(eset, batch = pData(eset)[, “batch”], covariates = pData(eset)[, “rin”])
    • limma::plotMDS(eset, labels = pData(eset)[, “time”], gene.selection = “common”)
  • For statistical analysis, it is better to include batch as a coefficient for analysis rather than to run the remove batch effect process

Visualizing results:

  • Can inspect the results and visualize using limma
    • results <- decideTests(fit2)
    • topTable(fit2, number = 3)
    • stats <- topTable(fit2, number = nrow(fit2), sort.by = “none”)
  • Under the null hypothesis of no impact, the p-values should be uniformly distributed
    • hist(runif(10000))
    • hist(stats[, “P.Value”]) # should have many values near zero if there is an actual impact
  • Can examine results using a Volcano plot
    • volcanoplot(fit2, highlight = 5, names = fit2$genes[, “symbol”])

Enrichment testing:

  • Can use curated biological databases as a reference point
  • Can use the Fisher’s exact test
    • fisher.test(matrix(c(10, 100, 90, 900), nrow = 2))
  • Can also test for KEGG (reference set) enrichment, which requires a common ID
    • entrez <- fit2$genes[, “entrez”]
    • enrich_kegg <- kegga(fit2, geneid = entrez, species = “Hs”) # Hs is homo sapiens
    • topKEGG(enrich_kegg, number = 3)
  • Can also test for GO (reference set) enrichment
    • enrich_go <- goana(fit2, geneid = entrez, species = “Hs”)
    • topGO(enrich_go, ontology = “BP”, number = 3)

Example code includes:

# Load package
library(limma)

# View the distribution of the raw data
plotDensities(eset, legend = FALSE)

# Log tranform
exprs(eset) <- log(exprs(eset))
plotDensities(eset, legend = FALSE)

# Quantile normalize
exprs(eset) <- normalizeBetweenArrays(exprs(eset))
plotDensities(eset, legend = FALSE)


# Load package
library(limma)

# View the normalized gene expression levels
plotDensities(eset, legend = FALSE); abline(v = 5)

# Determine the genes with mean expression level greater than 5
keep <- rowMeans(exprs(eset)) > 5
sum(keep)

# Filter the genes
eset <- eset[keep, ]
plotDensities(eset, legend = FALSE)


# Load package
library(limma)

# Plot principal components labeled by treatment
plotMDS(eset, labels = pData(eset)[, "treatment"], gene.selection = "common")

# Plot principal components labeled by batch
plotMDS(eset, labels = pData(eset)[, "batch"], gene.selection = "common")


# Load package
library(limma)

# Remove the batch effect
exprs(eset) <- removeBatchEffect(eset, batch = pData(eset)[, "batch"])

# Plot principal components labeled by treatment
plotMDS(eset, labels = pData(eset)[, "treatment"], gene.selection = "common")

# Plot principal components labeled by batch
plotMDS(eset, labels = pData(eset)[, "batch"], gene.selection = "common")


# Obtain the summary statistics for every gene
stats <- topTable(fit2, number = nrow(fit2), sort.by = "none")

# Plot a histogram of the p-values
hist(stats[, "P.Value"])


# Create a volcano plot. Highlight the top 5 genes
volcanoplot(fit2, highlight = 5, names = fit2$genes$symbol)


# Extract the entrez gene IDs
entrez <- fit2$genes[, "entrez"]

# Test for enriched KEGG Pathways
enrich_kegg <- kegga(fit2, geneid = entrez, species = "Hs")

# View the top 20 enriched KEGG pathways
topKEGG(enrich_kegg, number=20)


# Extract the entrez gene IDs
entrez <- fit2$genes[, "entrez"]

# Test for enriched GO categories
enrich_go <- goana(fit2, geneid = entrez, species = "Hs")

# View the top 20 enriched GO Biological Processes
topGO(enrich_go, ontology = "BP", number=20)

Chapter 4 - Case Study: Effect of Doxorubicin Treatment

Pre-process data:

  • Doxorubicin is a commonly-prescribed cancer drug, with a strong side effect of cariotoxicity
  • Hypothesis for the MOA is that top2B is involved, and was tested in mice
    • Wild mice tested against top2b null mice, each given DOX and a placebo
    • The eset data is 29532 x 12 (3 replicates per combination of 2x2 factors)
  • Begin by inspecting and then pre-processing the data
    • plotDensities(eset, group = pData(eset)[, “genotype”], legend = “topright”)
  • Prep-processing steps include log-transform, quantile transform, and filter

Model the data:

  • Can look at the main clusters of mice vs. treatments (top2b null seem to cluster together)
    • plotMDS(eset, labels = pData(eset)[, “genotype”], gene.selection = “common”)
    • plotMDS(eset, labels = pData(eset)[, “treatment”], gene.selection = “common”)
  • Follow the model.matrix process to run a differential expression analysis
    • Contrasts will include wild to dox, top2b to dox, interaction effect of wild vs. top2b to dox
    • Can also run hypothesis tests using the limma pipeline and a Venn diagram

Inspect the results:

  • Initial results seem to support the hypothesis that top2b is the main vector for cardiotoxicity
  • The limma function topTable can be run, with a sepcified contrast
    • coef = “dox_wt”
    • coef = “dox_top2b”
    • coef = “interaction”
  • Can use the volcanoplot for x-axis of log-fold change and y-axis for the log-odds of differential expression
    • kegga and topKEGG functions
    • species = “Mm” # common house mouse

Wrap up:

  • Pre-processing and visualization of gene data
  • Principal components analysis and plotMDS()
  • Fitting a group-means model for more interpretable contrasts
  • Investigation of p-values for uniformity vs. skew for low p-values
  • Use of volcanoplots
  • Testing for enrichment of differentially expressed genes

Example code includes:

# Log transform
exprs(eset) <- log(exprs(eset))
plotDensities(eset,  group = pData(eset)[, "genotype"], legend = "topright")

# Quantile normalize
exprs(eset) <- normalizeBetweenArrays(exprs(eset))
plotDensities(eset,  group = pData(eset)[, "genotype"], legend = "topright")

# Determine the genes with mean expression level greater than 0
keep <- rowMeans(exprs(eset)) > 0
sum(keep)

# Filter the genes
eset <- eset[keep, ]
plotDensities(eset, group = pData(eset)[, "genotype"], legend = "topright")


# Find the row which contains Top2b expression data
top2b <- which(fData(eset)["symbol"] == "Top2b")

# Plot Top2b expression versus genotype
boxplot(exprs(eset)[top2b, ] ~ pData(eset)[, "genotype"], main = fData(eset)[top2b, ])


# Plot principal components labeled by genotype
plotMDS(eset, labels = pData(eset)[, "genotype"], gene.selection = "common")

# Plot principal components labeled by treatment
plotMDS(eset, labels = pData(eset)[, "treatment"], gene.selection = "common")


# Create single variable
group <- with(pData(eset), paste(genotype, treatment, sep = "."))
group <- factor(group)

# Create design matrix with no intercept
design <- model.matrix(~0 + group)
colnames(design) <- levels(group)

# Count the number of samples modeled by each coefficient
colSums(design)


# Create a contrasts matrix
cm <- makeContrasts(dox_wt = wt.dox - wt.pbs,
                    dox_top2b = top2b.dox - top2b.pbs,
                    interaction = (top2b.dox - top2b.pbs) - (wt.dox - wt.pbs),
                    levels = design)

# View the contrasts matrix
cm


# Fit the model
fit <- lmFit(eset, design)

# Fit the contrasts
fit2 <- contrasts.fit(fit, contrasts = cm)

# Calculate the t-statistics for the contrasts
fit2 <- eBayes(fit2)

# Summarize results
results <- decideTests(fit2)
summary(results)

# Create a Venn diagram
vennDiagram(results)


# Obtain the summary statistics for the contrast dox_wt
stats_dox_wt <- topTable(fit2, coef = "dox_wt", number = nrow(fit2), sort.by = "none")
# Obtain the summary statistics for the contrast dox_top2b
stats_dox_top2b <- topTable(fit2, coef = "dox_top2b", number = nrow(fit2), sort.by = "none")
# Obtain the summary statistics for the contrast interaction
stats_interaction <- topTable(fit2, coef = "interaction", number = nrow(fit2), sort.by = "none")

# Create histograms of the p-values for each contrast
hist(stats_dox_wt[, "P.Value"])
hist(stats_dox_top2b[, "P.Value"])
hist(stats_interaction[, "P.Value"])


# Extract the gene symbols
gene_symbols <- fit2$genes[, "symbol"]

# Create a volcano plot for the contrast dox_wt
volcanoplot(fit2, coef = "dox_wt", highlight = 5, names = gene_symbols)

# Create a volcano plot for the contrast dox_top2b
volcanoplot(fit2, coef = "dox_top2b", highlight = 5, names = gene_symbols)

# Create a volcano plot for the contrast interaction
volcanoplot(fit2, coef = "interaction", highlight = 5, names = gene_symbols)


# Extract the entrez gene IDs
entrez <- fit2$genes[, "entrez"]

# Test for enriched KEGG Pathways for contrast dox_wt
enrich_dox_wt <- kegga(fit2, coef = "dox_wt", geneid = entrez, species = "Mm")

# View the top 5 enriched KEGG pathways
topKEGG(enrich_dox_wt, number = 5)

# Test for enriched KEGG Pathways for contrast interaction
enrich_interaction <- kegga(fit2, coef = "interaction", geneid = entrez, species = "Mm")

# View the top 5 enriched KEGG pathways
topKEGG(enrich_interaction, number = 5)

Interactive Data Visualization with bokeh

Chapter 1 - rbokeh Introduction

Getting started with rbokeh:

  • rbokeh is the R interface to the Python Bokeh plot package - interactive and informative for end users
  • Data manipulation and pre-processing are needed, with tidyverse being integral for this course
  • Can run an example using the gapminder dataset
    • library(gapminder)
    • data_2002 <- gapminder %>% filter(year == 2002)
    • gapminder_mod <- gapminder %>% mutate(pop_millions = pop/10^6)

Layers for rbokeh:

  • The rbokeh plot is initialized using figure() with layers added using the pipe operator (note the contrast with the plus used in ggplot2)
    • data_rwanda <- gapminder %>% filter(country == “Rwanda”)
    • figure() %>% ly_lines(x = year, y = lifeExp, data = data_rwanda)
    • figure() %>% ly_lines(x = data_rwanda\(year, y = data_rwanda\)lifeExp) # same output, but with different axis labels
  • Can look at the ggplot2::economics dataset
    • plot_pop <- figure() %>% ly_lines(x = date, y = pop, data = economics)
    • plot_pop

Layers for rbokeh (continued):

  • There are many layers in rbokeh, and they all begin with ly_…()
  • Example for creating a one-layer plot
    • dat_1982 <- gapminder %>% filter(year == 1982)
    • figure() %>% ly_points(x = gdpPercap, y = lifeExp, data = dat_1982)
  • Example for creating a multi-layer plot - note that the data argument can be added to figure, and will inherit to the succeeding ly_() ommands
    • data_oceania <- gapminder %>% filter(continent == “Oceania”)
    • figure(data = data_oceania, legend_location = “bottom_right”) %>% ly_lines(x = year, y = gdpPercap , color = country) %>% ly_points(x = year, y = gdpPercap, color = country)

Example code includes:

## load rbokeh, gapminder and dplyr libraries
library(rbokeh)
library(gapminder)
library(dplyr)


## explore gapminder dataset 
str(gapminder)

## filter gapminder data by year 1982
dat_1982 <- gapminder %>% filter(year == 1982)


## plot life expectancy Vs GDP per Capita using data_1982
figure(legend_location = "bottom_right", title = "Life Expectancy Vs. GDP per Capita in 1982") %>% 
    ly_points(x = gdpPercap, y = lifeExp, data = dat_1982, 
              color = continent, hover = c(continent, country, pop)
              )


## filter the dataset for the continent Africa and and year 1967
data_africa <- gapminder %>% 
  filter(year==1967, continent=="Africa")
  
## view data_africa
data_africa


## plot life expectancy Vs GDP per Capita using data_africa   
figure(legend_location = "bottom_right",
       title = "Life Expectancy Vs. GDP per Capita in Africa - 1967"
       ) %>% 
       ly_points(x = gdpPercap, y = lifeExp, data = data_africa, hover = c(country, pop))


## add a new column with gdp in millions
gapminder_mill <- gapminder %>% 
  mutate(gdp_millions = gdpPercap * pop / 10^6)
  
## view the first 6 entries in gapminder after adding  gdp_millions
head(gapminder_mill)

## extract the entries for "Rwanda"
data_rwanda <- gapminder_mill %>% 
  filter(country=="Rwanda")

## explore data_rwanda
data_rwanda


## plot gdp over time
figure(data = data_rwanda) %>% 
    ly_lines(x = year, y = gdp_millions, width = 2)


## explore the economics dataset
data(economics)
str(economics)

## pass vectors to x & y
figure() %>%
  ly_lines(x = economics$date, y = economics$pce)

## pass columns names and dataframe
figure() %>%
  ly_lines(x = date, y = pce, data = economics)


## plot unemployment rate  versus time and change the default `ylab`
figure(ylab = "unemployment %") %>%
  ly_lines(x=date, y=100*unemploy/pop, data=economics)


dat_1992 <- gapminder %>%
    filter(year==1992)
str(dat_1992)

## plot lifeExp Vs. gdpPercap using rbokeh
plot_1992<- figure(legend_location = "bottom_right") %>%
  ly_points(x=gdpPercap, y=lifeExp, color=continent, data=dat_1992) 

## show the plot            
plot_1992


data_countries <- gapminder %>%
    filter(country %in% c("United Kingdom", "Australia", "Canada", "United States", "New Zealand"))
str(data_countries)

figure(data = data_countries, legend="top_left") %>% 
  ly_lines(x = year, y = gdpPercap , color = country) %>% 
  ly_points(x=year, y=gdpPercap, color=country)


data_countries <- gapminder %>% 
    filter(country %in% c("China", "India"))

## create a line plot with lifeExp vs. year 
fig_countries <- figure(legend="top_left") %>% 
  ly_lines(x=year, y=lifeExp, color=country, data=data_countries)


## View fig_countries
fig_countries

## modify fig_countries by adding a points layer with gdpPercap vs. year 
fig_countries %>% 
  ly_points(x=year, y=lifeExp, color=country, data=data_countries)

Chapter 2 - rbokeh Aesthetic Attributes and Figure Options

Plot and Managed Attributes (Part I):

  • Can use aestehtic to modify areas like color, transparency, line type, shape, and the like
    • figure(legend_location = “bottom_right”, title = “Life Expectancy Vs. GDP per Capita in 1992” ) %>% ly_points(x = gdpPercap, y = lifeExp, data = dat_1992, color = continent)
  • Can use the Human Development Index (HDI) data from UNDP
    • hdi_countries <- hdi_data %>% filter(country %in% c(“Hungary”, “Bulgaria”, “Poland”))
    • fig_col <- figure(data = hdi_countries, legend_location = “bottom_right”) %>% ly_lines(x = year, y = human_development_index, color = country) %>% ly_points(x = year, y = human_development_index, color = country)
  • Can have varying color attributes by function - ly_points() has both fill_color() and line_color() which will both inherit from color
    • Can make the fill_color explicit and set its alpha (default is 0.5) explicitly also
    • fig_col <- figure(data = hdi_countries, legend_location = “bottom_right”) %>% ly_points(x = year, y = human_development_index, fill_color = country, fill_alpha = 1) %>% ly_lines(x = year, y = human_development_index, color = country)
  • Can add a custom color palette also
    • fig_col %>% set_palette(discrete_color = pal_color(c(“#3182bd”, “#31a354”, “#de2d26”)))

Plot and Managed Attributes (Part II):

  • Bechdel dataset - movie data on finances and exclusion of women
    • Bechdel criteria is a movie where two+ women have a discussion that is not about a male character
    • figure() %>% ly_points(x = budget_2013, y = intgross_2013, data = dat_90_13) # has an over-plotting problem and needs a log-transform
    • figure() %>% ly_points(x = log(budget_2013), y = log(intgross_2013), data = dat_90_13) # log transform helps with a lot (but not all) of the over-plotting
    • figure() %>% ly_points(x = log(budget_2013), y = log(intgross_2013), data = dat_90_13, alpha = 0.4, size = 5) # improves readability
  • May want to change the line widths for many countries
    • hdi_countries <- hdi_data %>% filter(country %in% c(“Rwanda”, “Kenya”, “Botswana”))
    • figure(title = “Human Development Index over Time”, legend = “bottom_right”) %>% ly_lines(x = year, y = human_development_index, data = hdi_countries, color = country)
    • Can use the width parameter to control the line width in ly_lines()
    • (WRONG FROM VIDEO) figure(title = “Human Development Index over Time”, legend = “bottom_right”) %>% ly_lines(x = year, y = human_development_index, data = hdi_countries, color = country, size = 3)
    • figure(title = “Human Development Index over Time”, legend = “bottom_right”) %>% ly_lines(x = year, y = human_development_index, data = hdi_countries, color = country, width = 3)

Hover Info and Figure Options:

  • Can combine the HDI and the CPI (corruption perception index)
    • The hover() argument added to the ly_points() can allow for hovering - see below
    • figure(legend_location = “bottom_right”, title = “CPI versus HDI - 2015”) %>% ly_points(x = corruption_perception_index, y = human_development_index, data = hdi_cpi_2015, color = continent, size = 7, hover = c(country, cpi_rank))
  • Can customize the hover commands using hover= where the @ means to place a variable from the frame at that point
    • figure(legend_location = “bottom_right”, title = “CPI versus HDI - 2015”) %>% ly_points(x = corruption_perception_index, y = human_development_index, data = hdi_cpi_2015, color = continent, size = 7, hover = “CPI Rank: @cpi_rank”)
    • Can also use basic html such as and

    • figure(legend_location = “bottom_right”, title = “CPI versus HDI - 2015”) %>% ly_points(x = corruption_perception_index, y = human_development_index, data = hdi_cpi_2015, color = continent, size = 7, hover = “@country
      CPI Rank: @cpi_rank”)
  • Can further add axis limits to the bokeh plots
    • hdi_cpi_scatter <- figure(legend_location = “bottom_right”, title = “CPI versus HDI - 2015”, ylim = c(0, 1), xlab = “CPI”, ylab = “HDI”, theme = bk_ggplot_theme()) %>% ly_points(x = corruption_perception_index_score, y = human_development_index, data = hdi_cpi_data, color = continent, size = 7)

Example code includes:

hdiRaw <- read.csv("./RInputFiles/Human Development Index (HDI).csv", skip=1)
str(hdiRaw)
hdi_data <- hdiRaw %>% 
    gather(key="year", value="human_development_index", -Country, -`HDI.Rank..2017.`) %>%
    mutate(country=str_trim(as.character(Country)), year=as.integer(str_sub(year, 2))) %>%
    filter(year %in% 1990:2105) %>%
    select(country, year, human_development_index)
str(hdi_data)

## extract "Namibia" and "Botswana" entries from hdi_data
hdi_countries <- hdi_data %>% 
    filter(country %in% c("Namibia", "Botswana"))
  
## plot human_development_index versus year
fig_col <- figure(data = hdi_countries, legend_location = "bottom_right") %>% 
    ly_lines(x = year, y = human_development_index, color = country) %>% 
    ly_points(x = year, y = human_development_index, 
              fill_color = "white", fill_alpha = 1,
              line_color = country, line_alpha = 1,
              size = 4
              )

## view plot 
fig_col


## use a custom palette with colors "green", "red"
fig_col %>% 
  set_palette(discrete_color = pal_color(c("green", "red")))

## define custom palette   
custom_pal <- pal_color(c("#c51b8a", "#31a354"))

## use custom_pal yp modify fig_col
fig_col %>% 
    set_palette(discrete_color=custom_pal)


## explore bechdel dataset using str
data(bechdel, package="fivethirtyeight")
str(bechdel)

## extract entries between 1980 - 2013
dat_80_13 <- bechdel %>% 
  filter(between(year, 1980, 2013))

dat_80_13 <- dat_80_13 %>% 
  mutate(roi_total = intgross_2013 / budget_2013) 
  
## plot
figure() %>% 
  ly_points(x=log(budget_2013), y=log(roi_total), data=dat_80_13)

## plot log(roi_total) versus log(budget_2013)
figure() %>% 
  ly_points(x=log(budget_2013), y=log(roi_total), size=5, line_alpha=0, fill_alpha=0.3, data=dat_80_13)


## filter the data by country = Syrian Arab Republic
hdi_countries <- hdi_data %>% 
  filter(country %in% c("Syrian Arab Republic", "Morocco"))

## change the color and line width
figure(title = "Human Development Index over Time", legend = "bottom_right") %>% 
    ly_lines(x=year, y=human_development_index, color=country, width=3, data=hdi_countries)


# explore hdi_cpi_data dataset
# str(hdi_cpi_2015)

## add multiple values as hover info (country, cpi_rank)
# figure(legend_location = "bottom_right") %>% 
#     ly_points(x=corruption_perception_index, y=human_development_index, color=continent, hover=c(country, cpi_rank), size=6, data=hdi_cpi_2015)


## modify the figure theme 
# figure(title = "Corruption Perception Index Vs. Human Development Index 2015",
#        legend_location = "bottom_right", xgrid = FALSE, ygrid = FALSE, 
#        xlab = "CPI", ylab = "HDI", theme=bk_ggplot_theme()) %>% 
#     ly_points(x = corruption_perception_index, y = human_development_index, 
#               data = hdi_cpi_2015, color = continent, size = 6, hover = c(country, cpi_rank)
#               )

Chapter 3 - Data Manipulation for Visualization and More rbokeh Layers

Data Formats:

  • The proper data format for plotting can make rbokeh much easier
  • Frequently need to transform data from long format to wide format for easier plotting - tidyr will help (inverse of the gather function)
    • hdi_cpi_wide <- hdi_cpi_long %>% spread(key = index, value = value)
  • May also want to transform data from wide format to long format, for example if time is a column
    • hdi_data_long <- hdi_data_wide %>% gather(key = year, value = human_development_index, - country) # year will become a new column, -country means leave country as its own column

More rbokeh Layers:

  • Can create a scatter plot with a regression line as an added layer
    • dat_90_13 <- bechdel %>% filter(between(year, 1990, 2013))
    • p_scatter <- figure() %>% ly_points(x = log(budget_2013), y = log(intgross_2013), data = dat_90_13, size = 5, alpha = 0.4)
    • lin_reg <- lm(log(intgross_2013) ~ log(budget_2013), data = dat_90_13)
    • summary(lin_reg)
    • p_scatter %>% ly_abline(lin_reg) # plots with the abline of the regression (the regression line) to the figure

Interaction Tools:

  • Can use the interaction tool to pan, zoom, reset, and the like
    • figure(tools=c(“pan”, “wheel_zoom”, “box_zoom”, “reset”, “save”, “help”), toolbar_location=“right”)
  • Tools can be any of “pan”, “wheel_zoom”, “box_zoom”, “resize”, “crosshair”, “box_select”, “lasso_select”, “reset”, “save”, “help”
    • Location can be any of ‘above’, ‘below’, ‘left’, ‘right’, NULL (remove the toolbar)
  • Example of customizing the available tools
    • figure(tools = c(“pan”, “wheel_zoom”, “box_zoom”), toolbar_location = “above”, legend_location = “bottom_right”, ylim = c(0, 100)) %>% ly_points(x = gdpPercap, y = lifeExp, data = gapminder_2002, color = continent, size = 6, alpha = 0.7)
  • Can create a plot and then use the widget2png tool to convert to PNG
    • plot_scatter <- figure(title = “Life Expectancy Vs. GDP per Capita in 2002”, legend_location = “bottom_right”) %>% ly_points(x = gdpPercap, y = lifeExp, data = gapminder_2002)
    • widget2png(p = plot_scatter, file = “plot_scatter.png”)
  • Can also save as html
    • rbokeh2html(fig = plot_scatter, file = “plot_scatter_interactive.html”)
    • browseURL(“plot_scatter_interactive.html”)

Example code includes:

ctry <- c('Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium', 'Benin', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Costa Rica', 'Croatia', 'Cuba', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Eritrea', 'Ethiopia', 'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana', 'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania', 'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman', 'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Romania', 'Rwanda', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Sierra Leone', 'Singapore', 'Slovenia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan', 'Sweden', 'Switzerland', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Uganda', 'United Kingdom', 'United States', 'Uruguay', 'Zambia', 'Zimbabwe', 'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium', 'Benin', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Costa Rica', 'Croatia', 'Cuba', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Eritrea', 'Ethiopia', 'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana', 'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania', 'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman', 'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Romania', 'Rwanda', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Sierra Leone', 'Singapore', 'Slovenia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan', 'Sweden', 'Switzerland', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Uganda', 'United Kingdom', 'United States', 'Uruguay', 'Zambia', 'Zimbabwe')
ctryCode <- c('AFG', 'ALB', 'DZA', 'AGO', 'ARG', 'AUS', 'AUT', 'BHR', 'BGD', 'BEL', 'BEN', 'BIH', 'BWA', 'BRA', 'BGR', 'BFA', 'BDI', 'KHM', 'CMR', 'CAN', 'CAF', 'TCD', 'CHL', 'CHN', 'COL', 'COM', 'CRI', 'HRV', 'CUB', 'CZE', 'DNK', 'DJI', 'DOM', 'ECU', 'EGY', 'SLV', 'ERI', 'ETH', 'FIN', 'FRA', 'GAB', 'GMB', 'DEU', 'GHA', 'GRC', 'GTM', 'GIN', 'GNB', 'HTI', 'HND', 'HUN', 'ISL', 'IND', 'IDN', 'IRQ', 'IRL', 'ISR', 'ITA', 'JAM', 'JPN', 'JOR', 'KEN', 'KWT', 'LBN', 'LSO', 'LBR', 'LBY', 'MDG', 'MWI', 'MYS', 'MLI', 'MRT', 'MUS', 'MEX', 'MNG', 'MON', 'MAR', 'MOZ', 'MMR', 'NAM', 'NPL', 'NLD', 'NZL', 'NIC', 'NER', 'NGA', 'NOR', 'OMN', 'PAK', 'PAN', 'PRY', 'PER', 'PHL', 'POL', 'PRT', 'ROM', 'RWA', 'STP', 'SAU', 'SEN', 'SCG', 'SLE', 'SGP', 'SVN', 'ZAF', 'ESP', 'LKA', 'SDN', 'SWE', 'CHE', 'THA', 'TGO', 'TTO', 'TUN', 'TUR', 'UGA', 'GBR', 'USA', 'URY', 'ZMB', 'ZWE', 'AFG', 'ALB', 'DZA', 'AGO', 'ARG', 'AUS', 'AUT', 'BHR', 'BGD', 'BEL', 'BEN', 'BIH', 'BWA', 'BRA', 'BGR', 'BFA', 'BDI', 'KHM', 'CMR', 'CAN', 'CAF', 'TCD', 'CHL', 'CHN', 'COL', 'COM', 'CRI', 'HRV', 'CUB', 'CZE', 'DNK', 'DJI', 'DOM', 'ECU', 'EGY', 'SLV', 'ERI', 'ETH', 'FIN', 'FRA', 'GAB', 'GMB', 'DEU', 'GHA', 'GRC', 'GTM', 'GIN', 'GNB', 'HTI', 'HND', 'HUN', 'ISL', 'IND', 'IDN', 'IRQ', 'IRL', 'ISR', 'ITA', 'JAM', 'JPN', 'JOR', 'KEN', 'KWT', 'LBN', 'LSO', 'LBR', 'LBY', 'MDG', 'MWI', 'MYS', 'MLI', 'MRT', 'MUS', 'MEX', 'MNG', 'MON', 'MAR', 'MOZ', 'MMR', 'NAM', 'NPL', 'NLD', 'NZL', 'NIC', 'NER', 'NGA', 'NOR', 'OMN', 'PAK', 'PAN', 'PRY', 'PER', 'PHL', 'POL', 'PRT', 'ROM', 'RWA', 'STP', 'SAU', 'SEN', 'SCG', 'SLE', 'SGP', 'SVN', 'ZAF', 'ESP', 'LKA', 'SDN', 'SWE', 'CHE', 'THA', 'TGO', 'TTO', 'TUN', 'TUR', 'UGA', 'GBR', 'USA', 'URY', 'ZMB', 'ZWE')
regn <- c('AP', 'ECA', 'MENA', 'SSA', 'AME', 'AP', 'WE/EU', 'MENA', 'AP', 'WE/EU', 'SSA', 'ECA', 'SSA', 'AME', 'WE/EU', 'SSA', 'SSA', 'AP', 'SSA', 'AME', 'SSA', 'SSA', 'AME', 'AP', 'AME', 'SSA', 'AME', 'WE/EU', 'AME', 'WE/EU', 'WE/EU', 'SSA', 'AME', 'AME', 'MENA', 'AME', 'SSA', 'SSA', 'WE/EU', 'WE/EU', 'SSA', 'SSA', 'WE/EU', 'SSA', 'WE/EU', 'AME', 'SSA', 'SSA', 'AME', 'AME', 'WE/EU', 'WE/EU', 'AP', 'AP', 'MENA', 'WE/EU', 'MENA', 'WE/EU', 'AME', 'AP', 'MENA', 'SSA', 'MENA', 'MENA', 'SSA', 'SSA', 'MENA', 'SSA', 'SSA', 'AP', 'SSA', 'SSA', 'SSA', 'AME', 'AP', 'ECA', 'MENA', 'SSA', 'AP', 'SSA', 'AP', 'WE/EU', 'AP', 'AME', 'SSA', 'SSA', 'WE/EU', 'MENA', 'AP', 'AME', 'AME', 'AME', 'AP', 'WE/EU', 'WE/EU', 'WE/EU', 'SSA', 'SSA', 'MENA', 'SSA', 'ECA', 'SSA', 'AP', 'WE/EU', 'SSA', 'WE/EU', 'AP', 'MENA', 'WE/EU', 'WE/EU', 'AP', 'SSA', 'AME', 'MENA', 'ECA', 'SSA', 'WE/EU', 'AME', 'AME', 'SSA', 'SSA', 'AP', 'ECA', 'MENA', 'SSA', 'AME', 'AP', 'WE/EU', 'MENA', 'AP', 'WE/EU', 'SSA', 'ECA', 'SSA', 'AME', 'WE/EU', 'SSA', 'SSA', 'AP', 'SSA', 'AME', 'SSA', 'SSA', 'AME', 'AP', 'AME', 'SSA', 'AME', 'WE/EU', 'AME', 'WE/EU', 'WE/EU', 'SSA', 'AME', 'AME', 'MENA', 'AME', 'SSA', 'SSA', 'WE/EU', 'WE/EU', 'SSA', 'SSA', 'WE/EU', 'SSA', 'WE/EU', 'AME', 'SSA', 'SSA', 'AME', 'AME', 'WE/EU', 'WE/EU', 'AP', 'AP', 'MENA', 'WE/EU', 'MENA', 'WE/EU', 'AME', 'AP', 'MENA', 'SSA', 'MENA', 'MENA', 'SSA', 'SSA', 'MENA', 'SSA', 'SSA', 'AP', 'SSA', 'SSA', 'SSA', 'AME', 'AP', 'ECA', 'MENA', 'SSA', 'AP', 'SSA', 'AP', 'WE/EU', 'AP', 'AME', 'SSA', 'SSA', 'WE/EU', 'MENA', 'AP', 'AME', 'AME', 'AME', 'AP', 'WE/EU', 'WE/EU', 'WE/EU', 'SSA', 'SSA', 'MENA', 'SSA', 'ECA', 'SSA', 'AP', 'WE/EU', 'SSA', 'WE/EU', 'AP', 'MENA', 'WE/EU', 'WE/EU', 'AP', 'SSA', 'AME', 'MENA', 'ECA', 'SSA', 'WE/EU', 'AME', 'AME', 'SSA', 'SSA')
cnt <- c('Asia', 'Europe', 'Africa', 'Africa', 'Americas', 'Oceania', 'Europe', 'Asia', 'Asia', 'Europe', 'Africa', 'Europe', 'Africa', 'Americas', 'Europe', 'Africa', 'Africa', 'Asia', 'Africa', 'Americas', 'Africa', 'Africa', 'Americas', 'Asia', 'Americas', 'Africa', 'Americas', 'Europe', 'Americas', 'Europe', 'Europe', 'Africa', 'Americas', 'Americas', 'Africa', 'Americas', 'Africa', 'Africa', 'Europe', 'Europe', 'Africa', 'Africa', 'Europe', 'Africa', 'Europe', 'Americas', 'Africa', 'Africa', 'Americas', 'Americas', 'Europe', 'Europe', 'Asia', 'Asia', 'Asia', 'Europe', 'Asia', 'Europe', 'Americas', 'Asia', 'Asia', 'Africa', 'Asia', 'Asia', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Asia', 'Africa', 'Africa', 'Africa', 'Americas', 'Asia', 'Europe', 'Africa', 'Africa', 'Asia', 'Africa', 'Asia', 'Europe', 'Oceania', 'Americas', 'Africa', 'Africa', 'Europe', 'Asia', 'Asia', 'Americas', 'Americas', 'Americas', 'Asia', 'Europe', 'Europe', 'Europe', 'Africa', 'Africa', 'Asia', 'Africa', 'Europe', 'Africa', 'Asia', 'Europe', 'Africa', 'Europe', 'Asia', 'Africa', 'Europe', 'Europe', 'Asia', 'Africa', 'Americas', 'Africa', 'Europe', 'Africa', 'Europe', 'Americas', 'Americas', 'Africa', 'Africa', 'Asia', 'Europe', 'Africa', 'Africa', 'Americas', 'Oceania', 'Europe', 'Asia', 'Asia', 'Europe', 'Africa', 'Europe', 'Africa', 'Americas', 'Europe', 'Africa', 'Africa', 'Asia', 'Africa', 'Americas', 'Africa', 'Africa', 'Americas', 'Asia', 'Americas', 'Africa', 'Americas', 'Europe', 'Americas', 'Europe', 'Europe', 'Africa', 'Americas', 'Americas', 'Africa', 'Americas', 'Africa', 'Africa', 'Europe', 'Europe', 'Africa', 'Africa', 'Europe', 'Africa', 'Europe', 'Americas', 'Africa', 'Africa', 'Americas', 'Americas', 'Europe', 'Europe', 'Asia', 'Asia', 'Asia', 'Europe', 'Asia', 'Europe', 'Americas', 'Asia', 'Asia', 'Africa', 'Asia', 'Asia', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Asia', 'Africa', 'Africa', 'Africa', 'Americas', 'Asia', 'Europe', 'Africa', 'Africa', 'Asia', 'Africa', 'Asia', 'Europe', 'Oceania', 'Americas', 'Africa', 'Africa', 'Europe', 'Asia', 'Asia', 'Americas', 'Americas', 'Americas', 'Asia', 'Europe', 'Europe', 'Europe', 'Africa', 'Africa', 'Asia', 'Africa', 'Europe', 'Africa', 'Asia', 'Europe', 'Africa', 'Europe', 'Asia', 'Africa', 'Europe', 'Europe', 'Asia', 'Africa', 'Americas', 'Africa', 'Europe', 'Africa', 'Europe', 'Americas', 'Americas', 'Africa', 'Africa')
idx <- rep(c("corruption_perception_index", "human_development_index"), each=121)
cpiRk <- c(166, 88, 88, 163, 106, 13, 16, 50, 139, 15, 83, 76, 29, 76, 69, 76, 150, 150, 130, 10, 145, 147, 23, 83, 83, 136, 40, 50, 56, 38, 1, 98, 102, 106, 88, 72, 154, 102, 3, 23, 98, 123, 11, 56, 58, 123, 139, 158, 158, 111, 50, 13, 76, 88, 161, 18, 32, 61, 69, 18, 45, 139, 55, 123, 61, 83, 161, 123, 111, 54, 95, 111, 45, 111, 72, 61, 88, 111, 147, 45, 130, 9, 1, 130, 98, 136, 5, 60, 117, 72, 130, 88, 95, 29, 28, 58, 43, 66, 48, 61, 71, 119, 7, 34, 61, 37, 83, 165, 4, 6, 76, 106, 72, 76, 66, 139, 11, 16, 21, 76, 150, 166, 88, 88, 163, 106, 13, 16, 50, 139, 15, 83, 76, 29, 76, 69, 76, 150, 150, 130, 10, 145, 147, 23, 83, 83, 136, 40, 50, 56, 38, 1, 98, 102, 106, 88, 72, 154, 102, 3, 23, 98, 123, 11, 56, 58, 123, 139, 158, 158, 111, 50, 13, 76, 88, 161, 18, 32, 61, 69, 18, 45, 139, 55, 123, 61, 83, 161, 123, 111, 54, 95, 111, 45, 111, 72, 61, 88, 111, 147, 45, 130, 9, 1, 130, 98, 136, 5, 60, 117, 72, 130, 88, 95, 29, 28, 58, 43, 66, 48, 61, 71, 119, 7, 34, 61, 37, 83, 165, 4, 6, 76, 106, 72, 76, 66, 139, 11, 16, 21, 76, 150)
vl <- c(0.479, 0.764, 0.745, 0.533, 0.827, 0.939, 0.893, 0.824, 0.579, 0.896, 0.485, 0.75, 0.698, 0.754, 0.794, 0.402, 0.404, 0.563, 0.518, 0.92, 0.352, 0.396, 0.847, 0.738, 0.727, 0.498, 0.776, 0.827, 0.775, 0.878, 0.925, 0.473, 0.722, 0.739, 0.691, 0.68, 0.42, 0.448, 0.895, 0.897, 0.697, 0.452, 0.926, 0.579, 0.866, 0.64, 0.414, 0.424, 0.493, 0.625, 0.836, 0.921, 0.624, 0.689, 0.649, 0.923, 0.899, 0.887, 0.73, 0.903, 0.742, 0.555, 0.8, 0.763, 0.497, 0.427, 0.716, 0.512, 0.476, 0.789, 0.442, 0.513, 0.781, 0.762, 0.735, 0.807, 0.647, 0.418, 0.556, 0.64, 0.558, 0.924, 0.915, 0.645, 0.353, 0.527, 0.949, 0.796, 0.55, 0.788, 0.693, 0.74, 0.682, 0.855, 0.843, 0.802, 0.498, 0.574, 0.847, 0.494, 0.776, 0.42, 0.925, 0.89, 0.666, 0.884, 0.766, 0.49, 0.913, 0.939, 0.74, 0.487, 0.78, 0.725, 0.767, 0.493, 0.91, 0.92, 0.795, 0.579, 0.516, 11, 36, 36, 15, 32, 79, 76, 51, 25, 77, 37, 38, 63, 38, 41, 38, 21, 21, 27, 83, 24, 22, 70, 37, 37, 26, 55, 51, 47, 56, 91, 34, 33, 32, 36, 39, 18, 33, 90, 70, 34, 28, 81, 47, 46, 28, 25, 17, 17, 31, 51, 79, 38, 36, 16, 75, 61, 44, 41, 75, 53, 25, 49, 28, 44, 37, 16, 28, 31, 50, 35, 31, 53, 31, 39, 44, 36, 31, 22, 53, 27, 84, 91, 27, 34, 26, 88, 45, 30, 39, 27, 36, 35, 63, 64, 46, 54, 42, 52, 44, 40, 29, 85, 60, 44, 58, 37, 12, 89, 86, 38, 32, 39, 38, 42, 25, 81, 76, 74, 38, 21)

hdi_cpi_data_long <- data.frame(country=ctry, year=2015L, country_code=ctryCode, cpi_rank=cpiRk, 
                                region=regn, continent=cnt, index=idx, value=vl,
                                stringsAsFactors = FALSE
                                )

## explore hdi_cpi_data_long
str(hdi_cpi_data_long)

## How many unique values are there in the index column?
unique(hdi_cpi_data_long$index)


## convert from long to wide
hdi_cpi_data_wide <- hdi_cpi_data_long %>% 
  spread(key=index, value=value)
  
## display the first 5 rows from hdi_cpi_data_wide
head(hdi_cpi_data_wide, 5)


## plot corruption_perception_index  versus human_development_index
figure(legend_location = "top_left") %>% 
    ly_points(x=human_development_index, y=corruption_perception_index, color=continent, alpha=0.7,
              hover=c(country, cpi_rank,corruption_perception_index, human_development_index), 
              data=hdi_cpi_data_wide
              )


## convert from wide to long
hdi_cpi_remake_long <- hdi_cpi_data_wide %>%
    gather(key="index", value="value", corruption_perception_index, human_development_index)
  
## display the first 5 rows of hdi_data_long
head(hdi_cpi_remake_long, 5)
all.equal(hdi_cpi_data_long, hdi_cpi_remake_long)


## explore the unique values in the movie_budget column
# unique(dat_90_13_long$movie_budget)

## spread the values in the `movie_budget` in two columns
# dat_90_13_wide <- dat_90_13_long %>% 
#   spread(key=movie_budget, value=value)
  
## View column names of dat_90_13_wide
# names(dat_90_13_wide)

## create a scatter plot with log(budget_2013) Vs log(intgross_2013) 
# p_scatter <- figure() %>%
#   ly_points(y=log(intgross_2013), x=log(budget_2013), size=4, alpha=0.5, data=dat_90_13_wide)
  
## View plot
# p_scatter

## fit a linear reg model
# lin_reg <- lm(log(intgross_2013) ~ log(budget_2013), data = dat_90_13)

## add the linear regression line layer to p_scatter
# p_scatter %>% 
#   ly_abline(lin_reg)


## extract entries for year 2007
dat_2007 <- gapminder %>% 
  filter(year == 2007)
dat_2002 <- gapminder %>% 
  filter(year == 2002)

## create scatter plot
figure(toolbar_location="above", legend_location="bottom_right") %>%
    ly_points(x=gdpPercap, y=lifeExp, color=continent, size=6, alpha=0.7, 
              data=dat_2007, hover=c(country, lifeExp, gdpPercap)
              )

figure(legend_location = "bottom_right", tools=c("resize", "save")) %>% 
    ly_points(x = gdpPercap, y = lifeExp, data = dat_2002, color = continent)

figure(legend_location = "bottom_right", tools=c("resize", "save"), toolbar_location=NULL) %>% 
    ly_points(x = gdpPercap, y = lifeExp, data = dat_2002, color = continent)

Chapter 4 - Grid Plots and Maps

Intro to Grid Plots:

  • Example of dataset for TB by year by age group in the US - combine multiple figures in the same area
    • figure() %>% ly_bar(x = year, y = count, data = tb_2534, color = gender, position = “stack”) # x should be a factor, default is stacked bars if position is missing (dodge or fill also available)
  • Can use the grid plot basics for multi-plotting
    • fig_list <- list(bar_2534 = bar_2534, bar_3544 = bar_3544)
    • grid_plot(fig_list, width = 1000, height = 500) # will have different axis limits by default
    • grid_plot(fig_list, width = 1000, height = 500, same_axes = TRUE) # forces the axes to be on the same scale
    • grid_plot(fig_list, width = 900, height = 600, nrow = 2, same_axes = TRUE) %>% theme_axis(“x”, major_label_orientation = 90) # overrides the defaults of all in one row
    • fig_list <- list(list(bar_1524 = bar_1524, bar_2534 = bar_2534), list(bar_3544 = bar_3544))
    • grid_plot(fig_list, same_axes = TRUE) %>% theme_axis(“x”, major_label_orientation = 90) # list of plots plots one list per row, and NULL will place an empty plot in that column

Facets with Grid Plots:

  • Facets can be helpful for slicing data, placing small batches of the data (segmented by factor) in a larger plot
  • Can start by creating a plot for each of the groups, though this is inefficient
    • fig_list <- list(bar_1524 = bar_1524, bar_2534 = bar_2534, bar_3544 = bar_3544, bar_4554 = bar_4554, bar_5564 = bar_5564, bar_65 = bar_65)
    • grid_plot(fig_list, width = 900, height = 600, nrow = 2, same_axes = TRUE) %>% theme_axis(“x”, major_label_orientation = 90)
  • Can instead use the split() function and a function for plotting to create the relevant lists
    • tb_split_age <- split(tb, tb$age)
    • plot_bar <- function(x){ figure() %>% ly_bar(y = count, year, data = x, color = gender, position = “dodge”)}
  • Can instead use the lapply() functionality
    • fig_list <- lapply(tb_split_age, plot_bar)
    • grid_plot(fig_list, width = 900, height = 600, nrow = 2, same_axes = TRUE) %>% theme_axis(“x”, major_label_orientation = 90)

rbokeh maps:

  • Can create interactive maps using rbokeh - transportation, density, population, and other areas where geography is key to understanding
  • NYC bike data available from the bikedata package
  • Can begin by initializing a map (appears to source Google Maps)
    • ny_map <- gmap(lat = 40.73306, lng = -73.97351, zoom = 11)
  • Can use the map type argument to change the type of map - hybrid, satellite, road, terrain
  • Can also customize maps using gamp_style(), such as making the water blue
    • ny_map <- gmap(lat = 40.73306, lng = -73.97351, zoom = 11, map_style = gmap_style(“blue_water”))
  • Can then add points layers to the map and arrange in grids using grid_plot() as per previous chapters
    • ny_map %>% ly_points(x = station_longitude, y = station_latitude, data = ny_bikedata_20170427, fill_color = start_count, line_alpha = 0, size = 8, hover = c(station_name, start_count))
    • grid_plot(list(weekeend_April23 = map_weekend_20170423, weekday_April25 = map_weekday_20170425), width = 860, height = 420)

Example code includes:

tb <- data.frame(iso2="US", 
                 gender=rep(c("m", "f"), each=84),
                 year=factor(c(1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008)), 
                 age=c(1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 1524, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 2534, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 3544, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 4554, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 5564, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65), 
                 count=c(355, 333, 330, 321, 331, 365, 320, 343, 365, 362, 383, 388, 414, 375, 876, 815, 701, 663, 616, 602, 613, 562, 526, 547, 535, 568, 490, 513, 1417, 1219, 1127, 1009, 1011, 906, 824, 813, 754, 728, 666, 659, 572, 495, 1121, 1073, 979, 1007, 930, 904, 876, 795, 828, 829, 767, 759, 744, 725, 742, 678, 679, 628, 601, 577, 524, 490, 487, 504, 499, 531, 533, 526, 1099, 1007, 944, 914, 801, 738, 649, 592, 650, 582, 624, 596, 562, 561, 280, 289, 269, 269, 232, 246, 239, 233, 277, 265, 241, 257, 257, 220, 579, 487, 449, 425, 391, 376, 410, 423, 353, 339, 348, 384, 338, 329, 499, 478, 447, 424, 394, 349, 346, 362, 310, 302, 276, 263, 260, 269, 285, 279, 254, 267, 245, 253, 247, 255, 269, 252, 242, 212, 225, 224, 202, 217, 201, 179, 244, 152, 176, 167, 169, 166, 161, 146, 135, 172, 591, 541, 514, 492, 444, 396, 389, 370, 354, 344, 322, 303, 308, 300), 
                 stringsAsFactors = FALSE
                 )
str(tb)
tb_2534 <- tb %>% filter(age==2534)
str(tb_2534)


## create a bar plot for age group tb_2534
bar_2534 <- figure() %>%
  ly_bar(x=year, y=count, color=gender, data=tb_2534, hover=TRUE)

## View figure
bar_2534


## create a bar plot for age group tb_2534 with % on the y-axis
bar_2534_percent <- figure(ylab = "share") %>% 
  ly_bar(x = year, y =  count,  tb_2534, color = gender, hover = TRUE,  position = "fill")
         
## View figure
bar_2534_percent


## create a list with bar_2534 and bar_2534_percent figures
fig_list <- list(bar_2534 = bar_2534, bar_2534_percent = bar_2534_percent)

## create a grid plot 
grid_plot(fig_list, width=1000, height=400)

## create a grid plot with same axes limits
grid_plot(figs = fig_list, width = 1000, height = 400, same_axes=TRUE)


plot_line <- function(x){
    figure() %>% 
        ly_lines(y =  count, year, data = x, color = age,  alpha = 1, width = 2)
}

## create two dataframes for female/male data           
tb_female <- tb %>% filter(gender=="f")
tb_male <- tb %>% filter(gender=="m")


## create two plots using plot_line
fig_female <- plot_line(tb_female)
fig_male <- plot_line(tb_male)

## create figure list
fig_list <- list(female = fig_female, male = fig_male)

## plot the two figures in a grid
grid_plot(fig_list, width=1000, height=600, same_axes=TRUE)


## split tb data by gender 
tb_split_gender <- split(tb, tb$gender)

## create a list of figures using lapply
fig_list <- lapply(tb_split_gender, FUN=plot_line)

## create a grid plot 
grid_plot(fig_list, width=1000, height=600, same_axes=TRUE)


## define a function to create a bar plot with the number of tb cases over time
plot_bar <- function(x){ 
    figure() %>% 
        ly_bar(y=count, x=year, data=x, color = gender, position = "dodge", hover=TRUE)
}

## split tb data by age
tb_split_age <- split(tb, tb$age)

## apply the function to the groups in tb_split_age
fig_list <- fig_list <- lapply(tb_split_age, plot_bar)

## create a grid plot 
grid_plot(fig_list, width=600, height=900, nrow=3, same_axes=TRUE) %>% 
    theme_axis("x", major_label_orientation = 90)


## initialize a map for NY center
# ny_map <- gmap(lat=40.73306, lng=-73.97351, zoom=11, map_style=gmap_style("blue_water"))
# ny_map


## filter ny_bikedata to get the entries for day "2017-04-25"
# ny_bikedata_20170425 <- ny_bikedata %>% filter(trip_date==as.Date("2017-04-25"))

## add a points layer to ny_map
# ny_map %>%
#     ly_points(y=station_latitude, x=station_longitude, 
#               size=8, fill_color=start_count, line_alpha=0, 
#               data=ny_bikedata_20170425, hover=c(station_name, start_count, end_count)
#               )

## create a names list with the two figures
# fig_list <- list(map_weekend=map_weekend_20170423, map_weekday=map_weekday_20170425)

## create a grid plot with the 2 maps
# grid_plot(fig_list, width=860, height=420)

A/B Testing in R

Chapter 1 - Mini Case Study in A/B Testing

Introduction:

  • A/B testing is a powerful way to experiment with potential changes before implementing them
    • Framework for testing new ideas to improve an existing design (often a website)
  • Hypothetical example - cat adoption website - could a change in home page improve conversion rate (clicks divided by views)?
    • Question - does changing the photo improve conversion rate?
    • Hypothesis - cat in hat will improve conversion rate
    • Dependent variable - clicks
    • Independent variable - homepage photo
  • Need to begin by assessing conversion rates in the current website
    • click_data <- read_csv(“click_data.csv”)

Baseline conversion rates:

  • Contnuning the previous example - hypothesis of cats in hats having “more” conversions
    • Need to define “more” relative to some baseline - recent past, control group at same time, etc.
    • click_data %>% summarize(conversion_rate = mean(clicked_adopt_today)) # mean from historical data
    • click_data_sum <- click_data %>% group_by(lubridate::month(visit_date)) %>% summarize(conversion_rate = mean(clicked_adopt_today)) # mean by month from the historical data
    • ggplot(click_data_sum, aes(x = month(visit_date), y=conversion_rate)) + geom_point() + geom_line()

Experimental design and power analysis:

  • Power analysis helps determine how many samples are needed (thus how long the experiment needs to run)
    • Ideal is to run both conditions simultaneously - mitigate seasonality and other potential confounders
    • Should know the planned test, baseline (control) value, and desired (test) value, as well as proportion of the data (typically 0.5), significance/alpha (typically 0.05), and power (typically 0.8)
  • Can use the powerMediation package to assess the power - note the function returns the total sample size, so each group is divided by 2
    • library(powerMediation)
    • total_sample_size <- SSizeLogisticBin(p1 = 0.2, p2 = 0.3, B = 0.5, alpha = 0.05, power = 0.8) # note the function returns the total sample size, so each group is divided by 2

Example code includes:

# Read in data
click_data <- readr::read_csv("./RInputFiles/click_data.csv")
## Parsed with column specification:
## cols(
##   visit_date = col_date(format = ""),
##   clicked_adopt_today = col_double()
## )
click_data
## # A tibble: 3,650 x 2
##    visit_date clicked_adopt_today
##    <date>                   <dbl>
##  1 2017-01-01                   1
##  2 2017-01-02                   1
##  3 2017-01-03                   0
##  4 2017-01-04                   1
##  5 2017-01-05                   1
##  6 2017-01-06                   0
##  7 2017-01-07                   0
##  8 2017-01-08                   0
##  9 2017-01-09                   0
## 10 2017-01-10                   0
## # ... with 3,640 more rows
# Find oldest and most recent date
min(click_data$visit_date)
## [1] "2017-01-01"
max(click_data$visit_date)
## [1] "2017-12-31"
# Calculate the mean conversion rate by day of the week
click_data %>%
  group_by(weekdays(visit_date)) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))
## # A tibble: 7 x 2
##   `weekdays(visit_date)` conversion_rate
##   <chr>                            <dbl>
## 1 Friday                           0.267
## 2 Monday                           0.277
## 3 Saturday                         0.256
## 4 Sunday                           0.3  
## 5 Thursday                         0.271
## 6 Tuesday                          0.271
## 7 Wednesday                        0.298
# Calculate the mean conversion rate by week of the year
click_data %>%
  group_by(lubridate::week(visit_date)) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))
## # A tibble: 53 x 2
##    `lubridate::week(visit_date)` conversion_rate
##                            <dbl>           <dbl>
##  1                             1           0.229
##  2                             2           0.243
##  3                             3           0.171
##  4                             4           0.129
##  5                             5           0.157
##  6                             6           0.186
##  7                             7           0.257
##  8                             8           0.171
##  9                             9           0.186
## 10                            10           0.2  
## # ... with 43 more rows
# Compute conversion rate by week of the year
click_data_sum <- click_data %>%
    mutate(weekOfYear = lubridate::week(visit_date)) %>%
    group_by(weekOfYear) %>%
    summarize(conversion_rate = mean(clicked_adopt_today))

# Build plot
ggplot(click_data_sum, aes(x = `weekOfYear`, y = conversion_rate)) +
    geom_point() +
    geom_line() +
    scale_y_continuous(limits = c(0, 1), labels = scales::percent)

# Compute and look at sample size for experiment in August
total_sample_size <- powerMediation::SSizeLogisticBin(p1 = 0.54, p2 = 0.64, 
                                                      B = 0.5, alpha = 0.05, power = 0.8
                                                      )
total_sample_size
## [1] 758
# Compute and look at sample size for experiment in August with 5% increase
total_sample_size <- powerMediation::SSizeLogisticBin(p1 = 0.54, p2 = 0.59, 
                                                      B = 0.5, alpha = 0.05, power = 0.8
                                                      )
total_sample_size
## [1] 3085

Chapter 2 - Mini Case Study in A/B Testing - Part II

Analyzing Results:

  • Can analyze the experiment data from the previous design - available in a new dataset
    • experiment_data <- read_csv(“experiment_data.csv”)
    • experiment_data %>% group_by(condition) %>% summarize(conversion_rate = mean(clicked_adopt_today))
    • experiment_data_sum <- experiment_data %>% group_by(visit_date, condition) %>% summarize(conversion_rate = mean(clicked_adopt_today))
    • ggplot(experiment_data_sum, aes(x = visit_date, y = conversion_rate, color = condition, group = condition)) + geom_point() + geom_line()
  • Can further assess the statistical significance of the outcomes
    • glm(clicked_adopt_today ~ condition, family = “binomial”, data = experiment_data) %>% broom::tidy()

Designing follow-up experiments:

  • Can continue to refine and test new hypotheses, but typically still with one step at a time
    • Experiments need to be unique, and each with their own control group
    • Attempt to avoid confounding variables; hard to explain real-world outcomes if there were many changes at the same time

Pre-follow-up-experiment assumptions:

  • Control conditions for seasonal products can be especially challenging - careful not to choose times that already have very extreme conversion rates

Follow-up experiment assumptions:

  • May want to look at the differences in conversion rate by month
    • eight_month_checkin_data_sum <- eight_month_checkin_data %>%
    • mutate(month_text = month(visit_date, label = TRUE)) %>% group_by(month_text, condition) %>%
    • summarize(conversion_rate = mean(clicked_adopt_today))
    • eight_month_checkin_data_diff <- eight_month_checkin_data_sum %>%
    • spread(condition, conversion_rate) %>%
    • mutate(condition_diff = cat_hat - no_hat)
    • mean(eight_month_checkin_data_diff$condition_diff)
    • sd(eight_month_checkin_data_diff$condition_diff)

Example code includes:

experiment_data <- read_csv("./RInputFiles/experiment_data.csv")
## Parsed with column specification:
## cols(
##   visit_date = col_date(format = ""),
##   condition = col_character(),
##   clicked_adopt_today = col_double()
## )
experiment_data
## # A tibble: 588 x 3
##    visit_date condition clicked_adopt_today
##    <date>     <chr>                   <dbl>
##  1 2018-01-01 control                     0
##  2 2018-01-01 control                     1
##  3 2018-01-01 control                     0
##  4 2018-01-01 control                     0
##  5 2018-01-01 test                        0
##  6 2018-01-01 test                        0
##  7 2018-01-01 test                        1
##  8 2018-01-01 test                        0
##  9 2018-01-01 test                        0
## 10 2018-01-01 test                        1
## # ... with 578 more rows
followup_experiment_data <- read_csv("./RInputFiles/eight_month_checkin_data.csv")
## Parsed with column specification:
## cols(
##   visit_date = col_date(format = ""),
##   condition = col_character(),
##   clicked_adopt_today = col_double()
## )
followup_experiment_data
## # A tibble: 4,860 x 3
##    visit_date condition clicked_adopt_today
##    <date>     <chr>                   <dbl>
##  1 2018-01-01 cat_hat                     1
##  2 2018-01-01 cat_hat                     1
##  3 2018-01-01 cat_hat                     0
##  4 2018-01-01 cat_hat                     0
##  5 2018-01-01 cat_hat                     0
##  6 2018-01-01 cat_hat                     0
##  7 2018-01-01 cat_hat                     0
##  8 2018-01-01 cat_hat                     0
##  9 2018-01-01 cat_hat                     1
## 10 2018-01-01 no_hat                      0
## # ... with 4,850 more rows
# Group and summarize data
experiment_data_clean_sum <- experiment_data %>%
    group_by(condition, visit_date) %>%
    summarize(conversion_rate = mean(clicked_adopt_today))

# Make plot of conversion rates over time
ggplot(experiment_data_clean_sum, aes(x = visit_date, y = conversion_rate, 
                                      color = condition, group = condition
                                      )
       ) + 
    geom_point() +
    geom_line()

# View summary of results
experiment_data %>% 
    group_by(condition) %>%
    summarize(conversion_rate = mean(clicked_adopt_today))
## # A tibble: 2 x 2
##   condition conversion_rate
##   <chr>               <dbl>
## 1 control             0.167
## 2 test                0.384
# Run logistic regression
experiment_results <- glm(clicked_adopt_today ~ condition, family = "binomial", 
                          data = experiment_data
                          ) %>%
    broom::tidy()
experiment_results
## # A tibble: 2 x 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      -1.61     0.156    -10.3  8.28e-25
## 2 conditiontest     1.14     0.197      5.77 7.73e- 9
# Run logistic regression power analysis
total_sample_size <- powerMediation::SSizeLogisticBin(p1 = 0.39, p2 = 0.59, B = 0.5, 
                                                      alpha = 0.05, power = 0.8
                                                      )
total_sample_size
## [1] 194
# View conversion rates by condition
followup_experiment_data %>%
    group_by(condition) %>%
    summarize(conversion_rate = mean(clicked_adopt_today))
## # A tibble: 2 x 2
##   condition conversion_rate
##   <chr>               <dbl>
## 1 cat_hat             0.459
## 2 no_hat              0.271
# Run logistic regression
followup_experiment_results <- glm(clicked_adopt_today ~ condition, family = "binomial",
                                   data = followup_experiment_data
                                   ) %>%
    broom::tidy()
followup_experiment_results
## # A tibble: 2 x 5
##   term            estimate std.error statistic  p.value
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)       -0.163    0.0407     -4.01 6.02e- 5
## 2 conditionno_hat   -0.825    0.0611    -13.5  1.66e-41
# Compute monthly summary
eight_month_checkin_data_sum <- followup_experiment_data %>%
    mutate(month_text = lubridate::month(visit_date, label = TRUE)) %>%
    group_by(month_text, condition) %>%
    summarize(conversion_rate = mean(clicked_adopt_today))

# Plot month-over-month results
ggplot(eight_month_checkin_data_sum, aes(x = month_text, y = conversion_rate, 
                                         color = condition, group = condition
                                         )
       ) +
    geom_point() +
    geom_line()

# Plot monthly summary
ggplot(eight_month_checkin_data_sum, aes(x = month_text, y = conversion_rate,
                                         color = condition, group = condition
                                         )
       ) +
    geom_point() +
    geom_line() +
    scale_y_continuous(limits = c(0, 1), labels = scales::percent) +
    labs(x = "Month", y = "Conversion Rate")

# Compute difference over time
# no_hat_data_diff <- no_hat_data_sum %>% 
#     spread(year, conversion_rate) %>% 
#     mutate(year_diff = `2018` - `2017`)
# no_hat_data_diff

# Compute summary statistics
# mean(no_hat_data_diff$year_diff, na.rm = TRUE)
# sd(no_hat_data_diff$year_diff, na.rm = TRUE)


# Run power analysis for logistic regression
total_sample_size <- powerMediation::SSizeLogisticBin(p1 = 0.49, p2 = 0.64, B = 0.5, 
                                                      alpha = 0.05, power = 0.8
                                                      )
total_sample_size
## [1] 341
# View summary of data
# followup_experiment_data_sep %>% 
#     group_by(condition) %>% 
#     summarize(conversion_rate=mean(clicked_adopt_today))

# Run logistic regression
# followup_experiment_sep_results <- glm(clicked_adopt_today ~ condition,
#                                        family = "binomial",
#                                        data = followup_experiment_data_sep
#                                        ) %>%
#     broom::tidy()
# followup_experiment_sep_results

Chapter 3 - Experimental Design in A/B Testing

A/B Testing Research Questions:

  • A/B testing combined experimental design and statistics - core building block basic principles
  • Any experimental design that compares two ideas can be run as an A/B test - conversion rates, engagement with a web site, drop-off rates, total amount of time or money spent
  • Example of looking at time spent on a websit
    • str(viz_website_2017)
    • viz_website_2017 %>% summarize(mean(time_spent_homepage_sec))
    • viz_website_2017 %>% group_by(month(visit_date)) %>% summarize(mean(time_spent_homepage_sec))

Assumptions and types of A/B testing:

  • Example of changing text in a website title, then checking the implications
  • Can look at within-group experiments (everyone sees both) or between-group experiments (everyone sees one or the other)
    • The within experiment will often have better power, while the between experiment is easier to run when people may only interact once
    • The between experiment needs to be appropriately random, so that whether the person sees A/B is not linked to other attributes of the person
  • There are several types of A/B testing
    • A/B testing is test and control
    • A/A testing is to verify that the control process is working well - should be no significant effects
    • A/B/N testing is a control conditions with any number of test conditions - seems exciting and fast, but has more challenging statistics and requires more data points

Confounding variables?

  • Confounding variables are elements of the environment that can confound your ability to find the real effect of A/B
    • Sometimes the confounder is internal to the experiment - examples of word length/novelty being the real driver rather than the specific work chosen
    • Sometimes the confounder is external to the experiment - examples of differing demographics by month, with the demographics having been a key driver of outcomes

Side effects:

  • Side effects are unintended effects of a change that you made
    • Example of changing from tools to tips if it changed the page loading times
  • Side effects can include load times and information “above the fold” (what a person sees without doing any scrolling)

Example code includes:

# Compute summary by month
viz_website_2017 %>%
    group_by(month(visit_date)) %>%
    summarize(article_conversion_rate = mean(clicked_article))


# Compute 'like' click summary by month
viz_website_2017_like_sum <- viz_website_2017 %>%
    mutate(month = month(visit_date, label = TRUE)) %>%
    group_by(month) %>%
    summarize(like_conversion_rate = mean(clicked_like))

# Plot 'like' click summary by month
ggplot(viz_website_2017_like_sum,
       aes(x = month, y = like_conversion_rate, group = 1)
       ) +
    geom_point() +
    geom_line() +
    scale_y_continuous(limits = c(0, 1), labels = percent)


# Plot comparison of 'like'ing and 'sharing'ing an article
ggplot(viz_website_2017_like_share_sum,
       aes(x = month, y = conversion_rate, color = action, group = action)
       ) +
    geom_point() +
    geom_line() +
    scale_y_continuous(limits = c(0, 1), labels = percent)


# Compute conversion rates for A/A experiment
viz_website_2018_01_sum <- viz_website_2018_01 %>%
    group_by(condition) %>%
    summarize(like_conversion_rate = mean(clicked_like))
viz_website_2018_01_sum

# Plot conversion rates for two conditions
ggplot(viz_website_2018_01_sum, aes(x = condition, y = like_conversion_rate)) +
    geom_bar(stat = "identity") +
    scale_y_continuous(limits = c(0, 1), labels = percent)


# Run logistic regression
aa_experiment_results <- glm(clicked_like ~ condition, family = "binomial", data = viz_website_2018_01) %>%
    broom::tidy()
aa_experiment_results


# Compute 'like' conversion rate by week and condition
viz_website_2018_02 %>%
    mutate(week = week(visit_date)) %>%
    group_by(week, condition) %>%
    summarize(like_conversion_rate = mean(clicked_like))

# Compute 'like' conversion rate by if article published and condition
viz_website_2018_02 %>%
    group_by(article_published, condition) %>%
    summarize(like_conversion_rate = mean(clicked_like))


# Plot 'like' conversion rates by date for experiment
ggplot(viz_website_2018_02_sum,
       aes(x = visit_date, y = like_conversion_rate, color = condition,
           linetype = article_published, group = interaction(condition, article_published)
           )
       ) +
    geom_point() +
    geom_line() +
    geom_vline(xintercept = as.numeric(as.Date("2018-02-15"))) +
    scale_y_continuous(limits = c(0, 0.3), labels = percent)


# Compute 'like' conversion rate and mean pageload time by day
viz_website_2018_03_sum <- viz_website_2018_03 %>%
    group_by(visit_date, condition) %>%
    summarize(mean_pageload_time = mean(pageload_time), like_conversion_rate = mean(clicked_like))

# Plot effect of 'like' conversion rate by pageload time
ggplot(viz_website_2018_03_sum, aes(x = mean_pageload_time, y = like_conversion_rate, color = condition)) +
    geom_point()


# Plot 'like' conversion rate by day
ggplot(viz_website_2018_03_sum, aes(x = visit_date, y = like_conversion_rate, color = condition,
                                    linetype = pageload_delay_added, 
                                    group = interaction(condition, pageload_delay_added)
                                    )
       ) +
    geom_point() +
    geom_line() +
    geom_vline(xintercept = as.numeric(as.Date("2018-03-15"))) +
    scale_y_continuous(limits = c(0, 0.3), labels = percent)

Chapter 4 - Statistical Analyses in A/B Testing

Power analyses:

  • Generally, the goal of a power analysis is to determine the sample size - dependent on alpha, power (1 minus beta), and effect size
    • Effect size is often defined as the difference in the two groups divided by the standard deviation of the groups
  • The t-test is often used for significance, and can be planned using the library(pwr)
    • pwr.t.test(power = 0.8, sig.level = 0.05, d = 0.6) # will return the number of data points needed for power 0.8, alpha 0.05, effect size 0.6
    • pwr.t.test(power = 0.8, sig.level = 0.05, d = 0.2) # effect size change of ~3x drives sample size change of ~9x (delta-effect-size-squared)

Statistical tests:

  • Logistic regression and t-tests are both common statistical methods used for A/B testing
    • viz_website_2018_01 <- read_csv(“viz_website_2018_01.csv”)
    • aa_experiment_results <- t.test(time_spent_homepage_sec ~ condition, data = viz_website_2018_01)
  • Linear regression can be thought of as an extension of t-tests with more than 2 levels per variable
    • However, for an A/B test with only 2-levels, you will get the same results

Stopping rules and sequential analysis:

  • Stopping rules are procedures that allow for interim analysis (peaks in to the data) - also known as “sequential analysis”
    • Can stop because the experiment worked, stop because the experiment failed, or continue experiment
    • The p-value needs to be adjusted lower to account for the multiple peaks at the data
    • Need to be very careful to prevent p-hacking by creating the stopping rules and points in advance
  • The library(gsDesign) can help with running sequrntial analysis in R
    • library(gsDesign)
    • seq_analysis <- gsDesign(k = 4, test.type = 1, alpha = 0.05, beta = 0.2, sfu = “Pocock”) # k=4 looks, test.type=1 is similar to one-sided test, alpha is significance, beta is 1-power so beta=0.2 is power=0.8, sfu is the spending function
  • Can then figure out the sample sizes using resource-based approaches
    • max_n <- 1000
    • max_n_per_group <- max_n / 2
    • stopping_points <- max_n_per_group * seq_analysis$timing

Multivariate testing:

  • Sometmes want to make comparisons that account for multiple changes
    • multivar_results <- lm(time_spent_homepage_sec ~ word_one data = viz_website_2018_05) %>% tidy() # single variable
    • multivar_results <- lm(time_spent_homepage_sec ~ word_one * word_two, data = viz_website_2018_05) %>% tidy() # full interaction effects
  • The default R order for regressions is to use the lowest alphanumeric as the baseline level - can modify this pre-regression though
    • multivar_results <- viz_website_2018_05 %>% mutate(word_one = factor(word_one, levels = c(“tips”, “tools”))) %>% mutate(word_two = factor(word_two, levels = c(“better”, “amazing”))) %>% lm(time_spent_homepage_sec ~ word_one * word_two, data = .) %>% tidy()

A/B Testing Recap:

  • Introduction to the basic concepts of A/B testing
  • New Ideas -> Experiments -> Statistical Analysis -> Implement Winners -> Repeat

Example code includes:

# Run power analysis for logistic regression
total_sample_size <- powerMediation::SSizeLogisticBin(p1 = 0.17, p2 = 0.27, 
                                                      B = 0.5, alpha = 0.05, power = 0.8
                                                      )
total_sample_size


# Run power analysis for t-test
sample_size <- pwr::pwr.t.test(d = 0.3, sig.level = 0.05, power = 0.8)
sample_size


# Run logistic regression
ab_experiment_results <- glm(clicked_like ~ condition, family = "binomial", data = viz_website_2018_04) %>%
    broom::tidy()
ab_experiment_results


# Run t-test
ab_experiment_results <- t.test(time_spent_homepage_sec ~ condition, data = viz_website_2018_04)
ab_experiment_results


# Run sequential analysis
seq_analysis_3looks <- gsDesign::gsDesign(k = 3, test.type = 1, 
                                          alpha = 0.05, beta = 0.2, sfu = "Pocock"
                                          )
seq_analysis_3looks


# Fill in max number of points and compute points per group and find stopping points
max_n <- 3000
max_n_per_group <- max_n / 2
stopping_points <- max_n_per_group * seq_analysis_3looks$timing
stopping_points


# Compute summary values for four conditions
viz_website_2018_05_sum <- viz_website_2018_05 %>% 
    group_by(word_one, word_two) %>% 
    summarize(mean_time_spent_homepage_sec = mean(time_spent_homepage_sec))

# Plot summary values for four conditions
ggplot(viz_website_2018_05_sum, aes(x = word_one, y = mean_time_spent_homepage_sec, fill = word_two)) + 
    geom_bar(stat = "identity", position = "dodge")


# Compute summary values for four conditions
viz_website_2018_05_sum <- viz_website_2018_05 %>% 
    group_by(word_one, word_two) %>% 
    summarize(like_conversion_rate = mean(clicked_like))

# Plot summary values for four conditions
ggplot(viz_website_2018_05_sum, aes(x = word_one, y = like_conversion_rate, fill = word_two)) +
    geom_bar(stat = "identity", position = "dodge") +
    scale_y_continuous(limits = c(0, 1), labels = percent)


# Organize variables and run logistic regression
viz_website_2018_05_like_results <- viz_website_2018_05 %>%
    mutate(word_one = factor(word_one, levels = c("tips", "tools"))) %>%
    mutate(word_two = factor(word_two, levels = c("better", "amazing"))) %>%
    glm(clicked_like ~ word_one * word_two, family = "binomial", data = .) %>%
    broom::tidy()
viz_website_2018_05_like_results

Mixture Models in R

Chapter 1 - Introduction to Mixture Models

Introduction to Model-Based Clustering:

  • Mixture models are a tool for model-based clustering (partitioning and segmentation)
    • Objective for clusters to be homogenous within and heterogenous across
  • Common techniques include k-means (assign to nearest centers) and hierarchical (connect based on sililarity) and probabilistic model-based approaches (mixture models)
    • gender <- read.csv(“gender.csv”)
    • ggplot(gender, aes(x = Weight, y = BMI)) + geom_points()
  • May be helpful to have probabilities by gender rather than a hard cutoff for male vs. female
    • This is the core of the mixture model - assumes an underlying probability distribution, with the outcome being the combination of these distributions

Gaussian Distribution:

  • There are packages for fitting mixture models in R - mixtools (no Poisson), bayesmix (Bayesian), EMCluster (Gaussian only), flexmix (covered in this course)
  • The Gaussain distribution frequently plays a role in the mixture model
    • Defined by the mean and standard deviation
    • rnorm(n, mean, sd) # sample from the Gaussian of mean and sd, taking n samples
    • The mean is typically estimated as the sample mean, and the sd is typically estimated as root-mean-squared-delta-from-mean - sd()
    • ggplot(data = population_sample) + geom_histogram(aes(x = x, y = ..density..)) + stat_function(geom = “line”, fun = dnorm, args = list(mean = mean_estimate, sd = standard_deviation_estimate))

Gaussian Mixture Models (GMM):

  • Can imagine two Gaussian distributions, and pick from each randomly with 50/50 probability
    • number_of_obs <- 500
    • coin <- sample(c(0,1), size = number_of_obs, replace = TRUE, prob = c(0.5, 0.5)) # can change the coin to be non-50/50 for other mixture simulations
    • gauss_1 <- rnorm(n = number_of_obs, mean = 5, sd = 2)
    • gauss_2 <- rnorm(n = number_of_obs)
    • mixture_simulation <- ifelse(coin, gauss_1, gauss_2)
    • head(cbind(coin, gauss_1, gauss_2, mixture_simulation))
    • mixture_simulation <- data.frame(x = mixture_simulation)
    • ggplot(mixture_simulation) + geom_histogram(aes(x = x, ..density..), bins = 40)
  • Can also create mixtures of 3+ underlying Gaussian
    • proportions <- sample(c(0, 1, 2), number_of_obs, replace = TRUE, prob = c(1/3, 1/3, 1/3))
    • gauss_3 <- rnorm(n = number_of_obs, mean = 10, sd = 1)
    • mixture_simulation <- data.frame(x = ifelse(proportions == 0, gauss_1, ifelse(proportions == 1, gauss_2, gauss_3)))
    • ggplot(mixture_simulation) + geom_histogram(aes(x = x, ..density..), bins = 40)

Example code includes:

gender <- readr::read_csv("./RInputFiles/gender.csv")
## Parsed with column specification:
## cols(
##   Gender = col_character(),
##   Height = col_double(),
##   Weight = col_double(),
##   BMI = col_double(),
##   probability = col_double()
## )
glimpse(gender)
## Observations: 10,000
## Variables: 5
## $ Gender      <chr> "Male", "Male", "Male", "Male", "Male", "Male", "M...
## $ Height      <dbl> 73.84702, 68.78190, 74.11011, 71.73098, 69.88180, ...
## $ Weight      <dbl> 241.8936, 162.3105, 212.7409, 220.0425, 206.3498, ...
## $ BMI         <dbl> 0.04435662, 0.03430822, 0.03873433, 0.04276545, 0....
## $ probability <dbl> 5.778312e-06, 6.059525e-01, 2.625952e-05, 3.628734...
# Have a look to gender (before clustering)
head(gender)
## # A tibble: 6 x 5
##   Gender Height Weight    BMI probability
##   <chr>   <dbl>  <dbl>  <dbl>       <dbl>
## 1 Male     73.8   242. 0.0444  0.00000578
## 2 Male     68.8   162. 0.0343  0.606     
## 3 Male     74.1   213. 0.0387  0.0000263 
## 4 Male     71.7   220. 0.0428  0.000363  
## 5 Male     69.9   206. 0.0423  0.00461   
## 6 Male     67.3   152. 0.0337  0.911
# Scatterplot with probabilities
gender %>% 
  ggplot(aes(x = Weight, y = BMI, col = probability))+
  geom_point(alpha = 0.5)

# Set seed
set.seed(1313)

# Simulate a Gaussian distribution
simulation <- rnorm(n = 500, mean = 5, sd = 4)

# Check first six values
head(simulation)
## [1]  2.618374  8.719739 10.469360 11.462134  6.165605  7.497809
# Estimation of the mean
mean_estimate <- mean(simulation)
mean_estimate
## [1] 5.324427
# Estimation of the standard deviation
standard_deviation_estimate <- sd(simulation)
standard_deviation_estimate
## [1] 3.769612
# Transform the results to a data frame
simulation <- data.frame(x = simulation)

# Plot the sample with the estimated curve
ggplot(simulation) + 
  geom_histogram(aes(x = x, y = ..density..)) + 
  stat_function(geom = "line", fun = dnorm,
                args = list(mean = mean_estimate, 
                sd = standard_deviation_estimate))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Estimation of the mean
mean_estimate <- gender %>% 
  pull(Weight) %>% 
  mean()
mean_estimate
## [1] 161.4404
# Estimation of the standard deviation
sd_estimate <- gender %>% 
  pull(Weight) %>% 
  sd()
sd_estimate
## [1] 32.10844
# Plot the sample with the estimated curve
gender %>% 
  ggplot() + 
  geom_histogram(aes(x = Weight, y = ..density..), bins = 100) + 
  stat_function(geom = "line", fun = dnorm,
                args = list(mean = mean_estimate, sd = sd_estimate))

# Create coin object
coin <- sample(c(0, 1), size = 500, replace = TRUE, prob = c(0.2, 0.8))

# Sample from two different Gaussian distributions
mixture <- ifelse(coin == 1, rnorm(n = 500, mean = 5, sd = 2), rnorm(n = 500))

# Check the first elements
head(mixture)
## [1] 6.715330 5.157209 2.158443 5.133819 6.089409 1.007223
# Transform into a data frame
mixture <- data.frame(x = mixture)

# Create histogram especifiying that is a density plot
mixture %>% ggplot() + 
    geom_histogram(aes(x = x, y = ..density..), bins = 50)

number_observations <- 1000

# Create the assignment object
assignments <- sample(c(0, 1 , 2), size = number_observations, replace = TRUE, prob = c(0.3, 0.4, 0.3))

# Simulate the GMM with 3 distributions
mixture <- data.frame(
    x = ifelse(assignments == 1, rnorm(n = number_observations, mean = 5, sd = 2), 
               ifelse(assignments == 2, 
                      rnorm(n = number_observations, mean = 10, sd = 1), 
                      rnorm(n = number_observations)
                      )
               )
    )

# Plot the mixture
mixture %>% 
    ggplot() + 
    geom_histogram(aes(x = x, y = ..density..), bins = 50)


Chapter 2 - Structure of Mixture Models and Parameter Estimation

Structure of Mixture Models:

  • Three questions to be answered for clustering with mixture models
    • Suitable probability distribution - depends on domain expertise
    • Number of clusters - domain expertise or testing to see what best satisfies criteria
    • Parameters and estimates - based on Expectation Maximization (EM) Algorithm
  • Example of the gender dataset - bivariate Gaussian, with two clusters, and mu/sigma/proportion for each
  • Example of handwritten digits (3 vs. 6) - Bernoulli distributions with two clusters, mean probability of being 1 for every dot
  • Example of crime types in Chicago - Poisson distribution with six clusters (crime types), average and proportion of crimes

Parameter Estimation:

  • Suppose that we have an assumption of 2 clusters each from a Gaussian distribution and with each distribution having the same sigma
    • If the probabilities are known, try to estimate the means
    • If the means are known, try to estimate the probabilities
    • means_estimates <- data_with_probs %>% summarise(mean_red = sum(x * prob_red) / sum(prob_red), mean_blue = sum(x * prob_blue) / sum(prob_blue))
    • proportions_estimates <- data_with_probs %>% summarise(proportion_red = mean(prob_red), proportion_blue = 1 - proportion_red)
    • data %>% mutate(prob_from_red = 0.3 * dnorm(x, mean = 3), prob_from_blue = 0.7 * dnorm(x,mean = 5), prob_red = prob_from_red/(prob_from_red + prob_from_blue), prob_blue = prob_from_blue/(prob_from_red + prob_from_blue)) %>%
    • select(x, prob_red, prob_blue) %>% head()

EM Algorithm:

  • Can begin by making naïve assumptions about the distributions, for later refinement
    • means_init <- c(1, 2)
    • props_init <- c(0.5, 0.5)
  • Can then run a first iteration of the probabilities - the expectations
    • means_estimates <- data_with_probs %>% summarise(mean_red = sum(x * prob_red) / sum(prob_red), mean_blue = sum(x * prob_blue) / sum(prob_blue)) %>% as.numeric()
    • props_estimates <- data_with_probs %>% summarise(proportion_red = mean(prob_red), proportion_blue = 1- proportion_red) %>% as.numeric()
  • Basically, the iterations continue
    • Iteration 0: Initial Parameters -> Estimate Probabilities (1)
    • Iteration 1: Estimated Probabilities (1) -> Estimated Parameters (2) -> Estimated Probabilities (2)
    • Iteration 2: Estimated Probabilities (2) -> Estimated Parameters (3) -> Estimated Probabilities (3)
    • Etc.
  • Can translate the iterative process in to a function that is called by way of a for loop
    • expectation <- function(data, means, proportions){
    • data <- data %>%
    •   mutate(prob_from_red = proportions[1] * dnorm(x, mean = means[1]),  
    •          prob_from_blue = proportions[2] * dnorm(x, mean = means[2]),  
    •          prob_red = prob_from_red/(prob_from_red + prob_from_blue),  
    •          prob_blue = prob_from_blue/(prob_from_red + prob_from_blue)
    •          ) %>%  
    •   select(x, prob_red, prob_blue)  
    • return(data)
    • }
    • for(i in 1:10){
    • new_values <- maximization(expectation(data, means_init, props_init))
    • means_init <- new_values[[1]]
    • props_init <- new_values[[2]]
    • cat(c(i, means_init, proportions_init),“”)
    • }

Example code includes:

digits <- readr::read_csv("./RInputFiles/digits.csv")
## Parsed with column specification:
## cols(
##   .default = col_double()
## )
## See spec(...) for full column specifications.
dim(digits)
## [1] 1593  266
digitData <- digits[, 1:256]
digitKey <- digits[, 257:266]

# keep a subset of 4 and 8
digitUse <- rowSums(digitKey[, c(5, 9)]==1)
digData <- digitData[digitUse, ]
digKey <- digitKey[digitUse, ]

show_digit <- function(arr256, col=gray(4:1/4), ...) {
    arr256 <- as.numeric(arr256)
    image(matrix(arr256, nrow=16)[,16:1],col=col,...)
}

# Dimension
# broom::glance(digits)

# Apply `glimpse` to the data
glimpse(digitData)
## Observations: 1,593
## Variables: 256
## $ V1   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V2   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V3   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ V4   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, ...
## $ V5   <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, ...
## $ V6   <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, ...
## $ V7   <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
## $ V8   <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
## $ V9   <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
## $ V10  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, ...
## $ V11  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, ...
## $ V12  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, ...
## $ V13  <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, ...
## $ V14  <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V15  <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V16  <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V17  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V18  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ V19  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, ...
## $ V20  <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, ...
## $ V21  <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, ...
## $ V22  <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, ...
## $ V23  <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, ...
## $ V24  <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, ...
## $ V25  <dbl> 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, ...
## $ V26  <dbl> 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, ...
## $ V27  <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, ...
## $ V28  <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, ...
## $ V29  <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, ...
## $ V30  <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, ...
## $ V31  <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V32  <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V33  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ V34  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, ...
## $ V35  <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, ...
## $ V36  <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, ...
## $ V37  <dbl> 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, ...
## $ V38  <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, ...
## $ V39  <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, ...
## $ V40  <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, ...
## $ V41  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, ...
## $ V42  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...
## $ V43  <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...
## $ V44  <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, ...
## $ V45  <dbl> 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, ...
## $ V46  <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, ...
## $ V47  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, ...
## $ V48  <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V49  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, ...
## $ V50  <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, ...
## $ V51  <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, ...
## $ V52  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, ...
## $ V53  <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, ...
## $ V54  <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, ...
## $ V55  <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V56  <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V57  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V58  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V59  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V60  <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...
## $ V61  <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, ...
## $ V62  <dbl> 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, ...
## $ V63  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, ...
## $ V64  <dbl> 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V65  <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, ...
## $ V66  <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, ...
## $ V67  <dbl> 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, ...
## $ V68  <dbl> 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V69  <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, ...
## $ V70  <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, ...
## $ V71  <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V72  <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V73  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V74  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V75  <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V76  <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V77  <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, ...
## $ V78  <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, ...
## $ V79  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, ...
## $ V80  <dbl> 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, ...
## $ V81  <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, ...
## $ V82  <dbl> 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, ...
## $ V83  <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, ...
## $ V84  <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, ...
## $ V85  <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ V86  <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, ...
## $ V87  <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V88  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V89  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V90  <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V91  <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V92  <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V93  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V94  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, ...
## $ V95  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V96  <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, ...
## $ V97  <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, ...
## $ V98  <dbl> 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ...
## $ V99  <dbl> 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, ...
## $ V100 <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ V101 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ V102 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, ...
## $ V103 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V104 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V105 <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V106 <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V107 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V108 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V109 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V110 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, ...
## $ V111 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V112 <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, ...
## $ V113 <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V114 <dbl> 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ...
## $ V115 <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, ...
## $ V116 <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ V117 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ V118 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ V119 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ V120 <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V121 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V122 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V123 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V124 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V125 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V126 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V127 <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V128 <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
## $ V129 <dbl> 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V130 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ...
## $ V131 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
## $ V132 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ V133 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ V134 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V135 <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V136 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V137 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V138 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V139 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V140 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V141 <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V142 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V143 <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V144 <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, ...
## $ V145 <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ...
## $ V146 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ...
## $ V147 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
## $ V148 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ V149 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ V150 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V151 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V152 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V153 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V154 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V155 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V156 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V157 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V158 <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V159 <dbl> 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
## $ V160 <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, ...
## $ V161 <dbl> 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ...
## $ V162 <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
## $ V163 <dbl> 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, ...
## $ V164 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, ...
## $ V165 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, ...
## $ V166 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V167 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V168 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V169 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V170 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V171 <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V172 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V173 <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V174 <dbl> 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, ...
## $ V175 <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, ...
## $ V176 <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, ...
## $ V177 <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, ...
## $ V178 <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
## $ V179 <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, ...
## $ V180 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, ...
## $ V181 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ V182 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V183 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V184 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V185 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V186 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V187 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V188 <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V189 <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, ...
## $ V190 <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, ...
## $ V191 <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, ...
## $ V192 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ V193 <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, ...
## $ V194 <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, ...
## $ V195 <dbl> 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, ...
## $ V196 <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, ...
## $ V197 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...
## $ V198 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V199 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V200 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V201 <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V202 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V203 <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V204 <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, ...
## $ V205 <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, ...
## $ V206 <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, ...
## $ V207 <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, ...
## $ V208 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V209 <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, ...
## $ V210 <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, ...
## $ V211 <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
## $ V212 <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, ...
## $ V213 <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, ...
## $ V214 <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...
## $ V215 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V216 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V217 <dbl> 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, ...
## $ V218 <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, ...
## $ V219 <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, ...
## $ V220 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, ...
## $ V221 <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, ...
## $ V222 <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, ...
## $ V223 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ V224 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V225 <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ V226 <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, ...
## $ V227 <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, ...
## $ V228 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, ...
## $ V229 <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V230 <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, ...
## $ V231 <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V232 <dbl> 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V233 <dbl> 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V234 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
## $ V235 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, ...
## $ V236 <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, ...
## $ V237 <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, ...
## $ V238 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V239 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V240 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V241 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V242 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V243 <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, ...
## $ V244 <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, ...
## $ V245 <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, ...
## $ V246 <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ...
## $ V247 <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V248 <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ V249 <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, ...
## $ V250 <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, ...
## $ V251 <dbl> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, ...
## $ V252 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, ...
## $ V253 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V254 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V255 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ V256 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
# Digit in row 50
show_digit(digitData[50, ])

# Digit in row 100
show_digit(digitData[100, ])

gaussian_sample_with_probs <- data.frame(
    x = c(54.5, 7.7, 55.9, 27.9, 4.6, 59.9, 6.4, 60.5, 32.6, 21.3, 0.5, 8.9, 70.7, 49.3, 40.1, 43, 8.1, 62.9, 56, 54.4, 42.5, 46.1, 58.3, 61.7, -11.6, 10.8, 27.5, 12.2, 67.7, -5.6, 13.3, 62.7, 37.2, 41.4, 47.4, 54.2, 31, 60.2, 69.9, 33.8, 25.4, 21.9, 17.9, 61.5, 49.8, 37.9, 55.8, 14.1, 53.3, 45.6, 44.7, 14.2, -5.7, 10.9, 63.7, -6.5, 50.3, 61.4, 35.1, -3.7, 68.4, -6.2, 64, 24.4, 65.7, 59.7, 52.7, 27.2, 17.5, 22.6, 14.7, 22.1, 61.5, 55.6, 62.6, 5.6, 52.3, 8, 25.4, 48.8, 58.4, 6.2, 52.3, 6.6, 64, 43, 60.6, 33.5, 45.8, 2.5, 63, 58.2, 50.9, 22.1, 36.5, 27.1, 61.4, 56.3, 63.5, 55.6, 53.8, 31.9, 30.7, 15.6, 14.8, 44.4, 51.9, 61.4, 11.8, 51.3, 58.6, 45.4, 8.3, 41.5, 52.7, 9.1, 60.8, 40.2, 20.5, 40.2, 59.2, 36.7, 47.5, 12.2, 7.7, 56.2, -13.2, 6, 58.7, 43.7, 67.3, 53.6, 37.6, 54.3, 37.7, 51.9, 10.5, 42, 24, -0.7, 53.1, 27.4, 57.2, 37.3, 28.6, 13.5, 35.2, 22.7, 35.8, 66.9, 45.9, 45.9, 56.7, 55.6, 58.3, 3.2, 45.9, 59.5, 50.8, 43.7, 42.8, 4.7, 29.5, 50.9, 7.8, 44.3, 53.6, 57, 57.8, 47.3, 56.8, 51.1, 27.7, 44.9, 33, 44, 42.1, 38, 52.3, 44, 28.1, 52.7, 53.6, 4.7, 42.1, 40.8, 5, 8, 49.1, 67.5, 16.2, 11.2, 14.6, 32.8, 61.3, 49.8, 51.5, 54.5, 51.6, 45.8, 55.9, 7.4, -10.2, 41.9, 27.4, 45.1, 17.7, 37.5, 53.5, 25.7, 18.1, 13.4, 40.5, 13.3, 2, 49.8, 66.7, 34.7, 11.4, 42.1, 54.4, 48.3, 38.3, 17.4, 48.2, 48.4, 57.4, 54.5, 13.6, 52.3, -0.1, 12.8, 29.3, 45.6, 62.3, 49.2, 32.6, 38.4, 15, 6.1, 12.2, 5.8, 17.7, 20.7, 43.6, 52.3, 42.4, 64.6, 34.3, 9.5, 3.6, 37.2, 45.7, 56.9, 67, 48.7, -3.1, 50.1, 45.4, 54.4, 38.1, 10.8, 7.4, 50.5, 24.7, 11.4, 59.5, 43.9, 4.4, 53.7, 41.9, 60.2, 49.5, 11.6, 51.1, 69.1, 46.2, 35.5, 15, -6.4, 59.9, 57.3, 49.1, 55.5, 55.6, 43.9, 52.5, 46.4, 5.8, 55.3, 22.2, 42.7, 51.3, 40.1, 62.1, 62.2, 48.8, 6.1, 0.6, 19.6, 36.8, 48, 33.8, 52.8, 66.6, 30.2, 45.9, 5.9, 52.7, 49.7, 37.7, 10.4, 60.1, 35.8, 62.1, 35, 38.7, 13.3, -4.9, 30.6, 55.9, 23.7, 12.6, 45.7, 38.1, 9.9, 39.6, 46.3, -3.5, 31.2, 8.3, -8.1, 31.4, 65.7, 10.7, 5.5, 54.4, 51.8, 59.8, 50.3, 45.1, 8.5, 15.3, 3.2, 19.3, 40.8, 48.4, 30.1, 32.7, 12.7, 59.2, 51.4, 55.3, 58.9, -19, 61.9, 30.3, 77.2, 39.8, 31.3, 23.1, 56, 41.9, 0.5, 33.4, 36.6, 54.4, 12.4, 16.4, 24.4, -2.4, 30.9, 56.4, 12.5, 65.2, 10, -1.7, 45.7, 49.5, 45.3, 17.5, 29, -8.7, 51.7, 17.3, 20.2, 14.6, 47.6, 55.3, 50.2, 4.1, 47.5, 71, 13.2, 75.4, 6.2, 53, 54.2, 40.6, 55.1, 67.4, 45, 47.3, 44.2, 8.4, 46.1, 48.7, 8.3, 40.4, 63, 49, 2.8, 50.4, 17.7, 40.4, 41.1, 56.6, 37.3, -0.1, 62.5, 47.7, 62.1, 16.6, 33.3, 4.1, 61, 49.4, 44.1, 18.7, -1.3, 42.1, -11.8, 40.6, 45.6, 14.9, 51.9, 57.4, 41.3, 59.2, 58.6, 50.5, -3.9, -0.6, 11.5, 54.5, 57.1, 46.2, 51.9, 58.2, 51.6, 50.3, 64.2, 8.3, 49, 42, 43.7, 53.4, 6.5, 36.6, -18.2, 41.8, -6.8, 35, 46.8, 43.8, 60.6, -11.3, 18.5, 0.3, 40.2, 73.3, 58.2, 43.9, 22.2, 12.8, 6.7, 36.3, 51.8, 33.6, 71, 56.8, 26, 43.3, 37.4, 60, 17.2, -10.3, 43.9, 69, 38.7, 57.9, 40.2, 48.6, 57.7, 45.8, 56.2, 7.3, 32.1, 41.2, 39.1), 
    prob_cluster1=c(0, 1, 0, 0.552, 1, 0, 1, 0, 0.158, 0.947, 1, 1, 0, 0, 0.01, 0.003, 1, 0, 0, 0, 0.004, 0.001, 0, 0, 1, 0.999, 0.591, 0.999, 0, 1, 0.998, 0, 0.03, 0.006, 0.001, 0, 0.268, 0, 0, 0.107, 0.773, 0.933, 0.985, 0, 0, 0.023, 0, 0.997, 0, 0.001, 0.002, 0.997, 1, 0.999, 0, 1, 0, 0, 0.065, 1, 0, 1, 0, 0.834, 0, 0, 0, 0.626, 0.988, 0.912, 0.996, 0.928, 0, 0, 0, 1, 0, 1, 0.773, 0, 0, 1, 0, 1, 0, 0.003, 0, 0.118, 0.001, 1, 0, 0, 0, 0.926, 0.038, 0.631, 0, 0, 0, 0, 0, 0.201, 0.286, 0.994, 0.996, 0.002, 0, 0, 0.999, 0, 0, 0.001, 1, 0.005, 0, 1, 0, 0.009, 0.961, 0.009, 0, 0.036, 0, 0.999, 1, 0, 1, 1, 0, 0.002, 0, 0, 0.025, 0, 0.024, 0, 0.999, 0.004, 0.855, 1, 0, 0.604, 0, 0.028, 0.484, 0.997, 0.062, 0.909, 0.05, 0, 0.001, 0.001, 0, 0, 0, 1, 0.001, 0, 0, 0.002, 0.003, 1, 0.398, 0, 1, 0.002, 0, 0, 0, 0.001, 0, 0, 0.576, 0.001, 0.138, 0.002, 0.004, 0.022, 0, 0.002, 0.533, 0, 0, 1, 0.004, 0.007, 1, 1, 0, 0, 0.993, 0.999, 0.996, 0.147, 0, 0, 0, 0, 0, 0.001, 0, 1, 1, 0.005, 0.606, 0.001, 0.987, 0.026, 0, 0.754, 0.985, 0.998, 0.008, 0.998, 1, 0, 0, 0.076, 0.999, 0.004, 0, 0, 0.019, 0.988, 0, 0, 0, 0, 0.997, 0, 1, 0.998, 0.417, 0.001, 0, 0, 0.162, 0.018, 0.995, 1, 0.998, 1, 0.987, 0.956, 0.002, 0, 0.004, 0, 0.087, 0.999, 1, 0.03, 0.001, 0, 0, 0, 1, 0, 0.001, 0, 0.021, 0.999, 1, 0, 0.82, 0.999, 0, 0.002, 1, 0, 0.005, 0, 0, 0.999, 0, 0, 0.001, 0.057, 0.995, 1, 0, 0, 0, 0, 0, 0.002, 0, 0.001, 1, 0, 0.925, 0.003, 0, 0.009, 0, 0, 0, 1, 1, 0.971, 0.035, 0, 0.104, 0, 0, 0.329, 0.001, 1, 0, 0, 0.024, 0.999, 0, 0.05, 0, 0.067, 0.016, 0.998, 1, 0.298, 0, 0.87, 0.998, 0.001, 0.021, 0.999, 0.012, 0.001, 1, 0.247, 1, 1, 0.233, 0, 0.999, 1, 0, 0, 0, 0, 0.001, 1, 0.995, 1, 0.975, 0.007, 0, 0.34, 0.157, 0.998, 0, 0, 0, 0, 1, 0, 0.325, 0, 0.011, 0.246, 0.895, 0, 0.005, 1, 0.12, 0.037, 0, 0.998, 0.992, 0.834, 1, 0.273, 0, 0.998, 0, 0.999, 1, 0.001, 0, 0.001, 0.987, 0.446, 1, 0, 0.989, 0.964, 0.996, 0, 0, 0, 1, 0, 0, 0.998, 0, 1, 0, 0, 0.008, 0, 0, 0.001, 0.001, 0.002, 1, 0.001, 0, 1, 0.008, 0, 0, 1, 0, 0.986, 0.008, 0.006, 0, 0.028, 1, 0, 0, 0, 0.991, 0.128, 1, 0, 0, 0.002, 0.98, 1, 0.004, 1, 0.008, 0.001, 0.996, 0, 0, 0.006, 0, 0, 0, 1, 1, 0.999, 0, 0, 0.001, 0, 0, 0, 0, 0, 1, 0, 0.004, 0.002, 0, 1, 0.037, 1, 0.005, 1, 0.068, 0.001, 0.002, 0, 1, 0.981, 1, 0.009, 0, 0, 0.002, 0.925, 0.998, 1, 0.042, 0, 0.114, 0, 0, 0.725, 0.003, 0.027, 0, 0.989, 1, 0.002, 0, 0.016, 0, 0.009, 0, 0, 0.001, 0, 1, 0.189, 0.006, 0.014)
)

gaussian_sample_with_probs <- gaussian_sample_with_probs %>%
    mutate(prob_cluster2 = 1-prob_cluster1)
glimpse(gaussian_sample_with_probs)
## Observations: 500
## Variables: 3
## $ x             <dbl> 54.5, 7.7, 55.9, 27.9, 4.6, 59.9, 6.4, 60.5, 32....
## $ prob_cluster1 <dbl> 0.000, 1.000, 0.000, 0.552, 1.000, 0.000, 1.000,...
## $ prob_cluster2 <dbl> 1.000, 0.000, 1.000, 0.448, 0.000, 1.000, 0.000,...
# Estimation of the means
means_estimates <- gaussian_sample_with_probs %>% 
    summarise(mean_cluster1= sum(x*prob_cluster1)/sum(prob_cluster1),
              mean_cluster2 = sum(x*prob_cluster2)/sum(prob_cluster2)
              )
means_estimates
##   mean_cluster1 mean_cluster2
## 1      10.39535      49.46501
# Estimation of the proportions
props_estimates <- gaussian_sample_with_probs %>% 
    summarise(props_cluster1 = mean(prob_cluster1),
              props_cluster2 = mean(prob_cluster2)
              )
props_estimates
##   props_cluster1 props_cluster2
## 1        0.33148        0.66852
# Transform to a vector
means_estimates <- as.numeric(means_estimates)

# Plot histogram with means estimates
ggplot(gaussian_sample_with_probs) + geom_histogram(aes(x = x), bins = 100) +
    geom_vline(xintercept = means_estimates)

gaussian_sample <- data.frame(
    x=c(6.4, 5.9, 57.8, 52.6, 54.3, 52.3, 4.4, 49.1, -4, 12.7, 19.8, 51.8, 35.4, 17.1, 38.8, 44.1, 45.6, 7.9, 57.7, 51.1, 14.1, 36.6, 51.6, 4.1, -1.8, 55.1, 52.4, 54.4, 47.9, 36.6, 53.9, 15, 68.8, 8.3, 40.8, 39.3, 37.1, 12.7, 54.6, 34.1, 24.9, 58.5, 50.8, 48.6, 60, 52.1, 61.5, 6.9, 63, 63.5, 54.1, 37.7, 52.6, 49.1, 53.7, 13.4, 23.6, 45.5, 33.4, 46.4, 46.6, 56.1, 37.8, 44.1, 62.4, 12, 54.4, 31.6, -1, 9.4, 16, 53.4, 71.1, 8.9, 64.4, 55.9, 50.5, 57.2, 45.9, 18.5, 53.9, 12.5, 12.2, 1.5, 0.3, 40.1, 13.9, 53.2, 12.1, 57.2, 2.3, -2.6, 2.7, 59.6, 3, 10.3, 66.9, 57.3, 57.6, 9.1, 43.8, 51.1, 7.7, 13.4, 46.3, 57.5, 0.2, 1.9, 43.8, 53.9, 9.3, 45.5, 15.4, -3.2, -1.2, 40.5, 1.9, 14.5, -2, 3.4, 54.1, 2.9, 58.2, 49.5, 49.1, 60.2, 45.3, 59.7, 38, 22.4, 42.6, 53.6, 7.3, 43.9, 2.8, 66.5, 56.5, 44.4, 53.5, 40.6, 57.1, 43.8, -3.1, 47.3, 42.5, 50.8, -12, -12, 15.2, 43.8, 57.3, 32.2, 61.1, 15.1, 5.8, 24.7, 51.5, 7.7, -5.1, 63.1, 50.1, 39.9, 38.7, -5.2, 50.3, 49.1, 58.1, 31.3, 54.6, 39.1, 4.4, 60.5, 45.6, 59.7, 39.5, 60.6, 42.8, 49.5, 12.9, 47.2, 50, 11.4, 50.9, 57.3, 46.7, 35.6, 38.8, 56, -5.7, 50.5, 21.2, 45.9, 60.7, 22.1, 46.7, 12.5, 55.2, 48.4, 36.6, 54, 47, 50.3, 51.7, 11, 56, 42.4, 61.8, 45.6, 60.5, 40.6, 8.8, 21, 5.6, 68.2, 21.3, 11.5, 47.2, 26.4, 35.8, 25.4, 19.6, 56, 9.1, 63.4, 48.5, 3.2, 57.1, 52.7, 11.3, 16.3, 49, 46.5, 12.4, 9.6, 45.5, 55.3, 72.9, 8.1, -3.8, 53.8, 34.1, 45.7, 56.3, 44, 23.4, 57.2, 0.5, 33.2, 63.4, 37.3, 57.3, 52.7, 9.7, 51.9, 39.4, 63.7, 23.3, 39.9, -0.5, 41.6, 11.3, 48, 38.2, 54.2, 41.3, 30.6, 55.2, 48.9, 34.4, 16.2, 45.7, 10.1, 42.7, 12.2, 39.5, 14.1, 64.9, 53.1, 50.4, 47, 58.5, 50.8, 43.9, 56.8, 12.6, 44.5, 54.6, 8.9, 15.5, 50.2, 4.8, 52.8, 14.4, 33.7, 5.4, -0.2, 19.8, 51, 59.4, -8.2, 10.4, 47.8, 31.2, 41.4, 9.4, -3.2, 21.1, 44.7, 22.9, 11.5, 49.6, 26.7, 11.5, 35.2, 9.4, 44.8, 63.1, 8.5, 21, 30.9, 16.1, 54.4, 53.4, 9.7, 49.8, 45.6, -3, 53, 43.4, 43.4, 43.9, 56.6, 33.5, 55.1, 54.4, 62.8, 37.9, 35.1, 8.6, 7.1, 46.1, 6.1, 27, -9.9, 6.4, 44.6, 49, 46, 42.4, 9.5, 47.1, 51.3, -4.7, 14, 64.8, 38, 33.6, -0.4, 53.5, 40.3, 47.2, 58.5, 45.4, 2.5, 52.9, 47.4, 56.1, 17.7, 3.9, 30.7, 44.6, 42.4, 55.4, 47.1, 11.5, 50.7, 47.6, 11.3, 45.1, 44.2, 46.6, 36.9, 47.4, 54.6, -2, 50.7, 63.6, 58.9, 7.6, -3.1, 31.1, 44.9, 55.7, 16.6, 64.3, 27.1, 23, 48.7, -0.8, 23.6, 72.8, 11.9, 57.3, 25.4, 47.1, 9.4, 57.6, 39.6, 25.3, 31.2, 52.4, 51.1, 1.6, 76.5, 50.7, 34.2, 7.6, 25.4, 11.7, 53.5, 17.5, 53.7, 61.2, 49.9, 48.8, 40.8, 61.2, 16.4, 48.6, 7.5, -2, 64.2, 26.2, 11.2, 3.2, -4.3, 37.9, 47.7, 26.3, 58, 66.9, 59.1, 35.8, 14.2, 53, 60.3, 63.3, 53.6, 47.6, 57.1, 37, 47.6, 61.6, 52.7, 0.8, 50.5, 48.1, -3.4, 53.6, 35.7, 49.8, 2.7, 59.9, 36.5, 63.6, 53.3, 3.8, 20.2, 19.7, 20.7, 45.6, 39.8, 37.2, 38.6, 12.4, 56.3, 59.6, 10.5, 11, -6.8, 58.8, 49.5, -3.6, 51.1, 53.1, 46, 57.9, 15.2, -2.3, 22.9, 32.8, 37.6, 52, 77.5, 2.2, 9.5, 40.4, 48.5, 27.2, 37.4)
)
str(gaussian_sample)
## 'data.frame':    500 obs. of  1 variable:
##  $ x: num  6.4 5.9 57.8 52.6 54.3 52.3 4.4 49.1 -4 12.7 ...
# Create data frame with probabilities
gaussian_sample_with_probs <- gaussian_sample %>% 
    mutate(prob_from_cluster1 = 0.35 * dnorm(x, mean = 10, sd = 10),
           prob_from_cluster2 = 0.65 * dnorm(x, mean = 50, sd = 10),
           prob_cluster1 = prob_from_cluster1/(prob_from_cluster1 + prob_from_cluster2),
           prob_cluster2 = prob_from_cluster2/(prob_from_cluster1 + prob_from_cluster2)) %>%
    select(x, prob_cluster1, prob_cluster2) 
head(gaussian_sample_with_probs)
##      x prob_cluster1 prob_cluster2
## 1  6.4  9.998524e-01  0.0001475847
## 2  5.9  9.998792e-01  0.0001208354
## 3 57.8  7.976210e-06  0.9999920238
## 4 52.6  6.384176e-05  0.9999361582
## 5 54.3  3.234434e-05  0.9999676557
## 6 52.3  7.198080e-05  0.9999280192
expectation <- function(data, means, proportions, sds){
  # Estimate the probabilities
  exp_data <- data %>% 
      mutate(prob_from_cluster1 = proportions[1] * dnorm(x, mean = means[1], sd = sds[1]),
             prob_from_cluster2 = proportions[2] * dnorm(x, mean = means[2], sd = sds[2]),
             prob_cluster1 = prob_from_cluster1/(prob_from_cluster1 + prob_from_cluster2),
             prob_cluster2 = prob_from_cluster2/(prob_from_cluster1 + prob_from_cluster2)) %>%
      select(x, prob_cluster1, prob_cluster2)
    # Return data with probabilities
  return(exp_data)
}

maximization <- function(data_with_probs){
    means_estimates <- data_with_probs %>%
        summarise(mean_1 = sum(x * prob_cluster1) / sum(prob_cluster1),
                  mean_2 = sum(x * prob_cluster2) / sum(prob_cluster2)
                  ) %>% 
        as.numeric()
    props_estimates <- data_with_probs %>% 
        summarise(proportion_1 = mean(prob_cluster1), proportion_2 = 1 - proportion_1) %>% 
        as.numeric()
    list(means_estimates, props_estimates)   
}


means_init <- c(0, 100)
props_init <- c(0.5, 0.5)

# Iterative process
for(i in 1:10){
    new_values <- maximization(expectation(gaussian_sample, means_init, props_init, c(10, 10)))
    means_init <- new_values[[1]]
    props_init <- new_values[[2]]
    cat(c(i, means_init, props_init), "\n")
}
## 1 25.28863 56.90005 0.6797875 0.3202125 
## 2 20.01129 53.44814 0.539439 0.460561 
## 3 14.77156 51.48322 0.4377961 0.5622039 
## 4 11.62146 50.28191 0.3846544 0.6153456 
## 5 10.34764 49.72052 0.363436 0.636564 
## 6 9.918957 49.49888 0.355935 0.644065 
## 7 9.777305 49.41932 0.3533705 0.6466295 
## 8 9.730017 49.39192 0.3525025 0.6474975 
## 9 9.714139 49.38262 0.3522096 0.6477904 
## 10 9.708796 49.37948 0.3521109 0.6478891
fun_gaussian <- function(x, mean, proportion){
    proportion * dnorm(x, mean, sd = 10)
}

means_iter10 <- means_init
props_iter10 <- props_init

gaussian_sample %>% ggplot() + 
    geom_histogram(aes(x = x, y = ..density..), bins = 200) +
    stat_function(geom = "line", fun = fun_gaussian, 
                  args = list(mean = means_iter10[1], proportion = props_iter10[1])
                  ) +
    stat_function(geom = "line", fun = fun_gaussian,
                  args = list(mean = means_iter10[2], proportion = props_iter10[2])
                  )


Chapter 3 - Mixture of Gaussians with flexmix

Univariate Gaussian Mixture Models:

  • Gaussian mixture models are formed by Gaussian distributions, which can have many potential parameters
    • Simplest version is the univariate Gaussian, such as the BMI dataset with no labels for gender
  • Example of using the gender dataset for segmenting assuming that the data are not labeled by gender
    • Weight should be a Gaussian - continuous, not linked to integer values, etc.
    • Histogram looks like two Gaussians, so begin with the assumption of 2 univariate Gaussians with a resulting 2 segments
    • Parameters can initially be estimated with a baseline mean, sd, and proportion (6 total parameters)
    • The EM algorithm, implemented in flexmix, can then help to simplify the calculations

Univariate Gaussian Mixture Models with flexmix:

  • Can begin by checking what the most suitable distributions might be
    • gender %>% ggplot(aes(x = Weight)) + geom_histogram(bins = 100)
    • Univariate Gaussian with 2 clusters
  • Can use the flexmix::flexmix(formula, data, k, models, control, …) to make the estimates
    • formula - describes the model to be fit (often variable ~ 1)
    • data - data frame
    • k - number of clusters
    • models - distribution to be considered, such as FLXMCnorm1 for the univariate Gaussian
    • control - maximum iterations and tolerance
  • Example of fitting to a dataset using flexmix
    • fit_mixture <- flexmix(Weight ~ 1, data = gender, k = 2, model = FLXMCnorm1(), control = list(tol = 1e-15, verbose = 1, iter = 1e4))
    • proportions <- prior(fit_mixture) # proportions from the model
    • parameters(fit_mixture) # all the parameters
    • comp_1 <- parameters(fit_mixture, component = 1) # just the parameters for component 1
    • posterior(fit_mixture) %>% head() # probability of belonging to each cluster by observation
    • posterior(fit_mixture) %>% head() # assignment to the cluster with maximum probability
    • table(gender$Gender, clusters(fit_mixture))

Bivariate Gaussian Mixture Models with flexmix:

  • May want to use additional variables to improve the clustering - for example, using both height and weight from the gender dataset
    • Bivariate Gaussian with 2 clusters, and needing to estimate additional means and standard deviations (which need to also be between the variables)
  • The flexmix library implements the Bivariate Gaussian distribution, which is conceptually like
    • There would be two means - variable 1 and variable 2
    • There would be a 2x2 covariance matrix - sd1, sd2, cov(1, 2), cov(1, 2) - really, three terms since the off-diagonals will be equal

Bivariate Gaussian Mixture Models with flexmix:

  • Example of a covariance matrix without cross-terms
    • fit_without_cov <- flexmix(cbind(Weight, BMI) ~ 1, k = 2, data = gender, model = FLXMCmvnorm(diag = TRUE), control = list(tolerance = 1e-15, iter.max = 1000)) # cbind() is because there are two variables; diag=TRUE means no covariance
    • proportions <- prior(fit_without_cov)
    • parameters(fit_without_cov)
    • comp_1 <- parameters(fit_without_corr, component=1)
    • comp_2 <- parameters(fit_without_corr, component=2)
    • mean_comp_1 <- comp_1[1:2]
    • mean_comp_2 <- comp_2[1:2]
  • Can then visualize the resulting clusters
    • library(ellipse)
    • ellipse_comp_1 <- ellipse(x = covariance_comp_1, centre = mean_comp_1, npoints = nrow(gender))
    • ellipse_comp_2 <- ellipse(x = covariance_comp_2, centre = mean_comp_2, npoints = nrow(gender))
    • gender %>%
    • ggplot(aes(x = Weight, y = BMI)) + geom_point() +
    •   geom_path(data = data.frame(ellipse_comp_1), aes(x=x,y=y), col = "red") +  
    •   geom_path(data = data.frame(ellipse_comp_2), aes(x=x,y=y), col = "blue"  
  • Need to include joint variability to improve the modeling
    • fit_with_cov <- flexmix(cbind(Weight, BMI) ~ 1, k = 2, data = gender, model = FLXMCmvnorm(diag = FALSE), control = list(tolerance = 1e-15, iter.max = 1000)) # cbind() is because there are two variables; diag=FALSE means covariance

Example code includes:

xExample <- c(7.3, 58.7, 9.7, 16.9, 6.3, 35.1, 33.5, 61.3, 28.3, 24.3, 58.6, 13.1, 58.7, 34, 29.1, 46.4, 54.6, 5.9, 30.6, 27.9, 27.5, -5.3, 37.6, 9.1, 44.5, 57.5, 30.5, 5, 51.9, 33.6, 37.4, 28.8, 47.9, 5.4, 64.1, 45.1, 41, 36.3, 28.2, 33.8, 9.8, 57.4, 48.4, 58.3, 27.7, 38.4, 36.4, 66.9, 30.7, 34.3, 25.9, 48.5, 52, 0.3, 45.3, 31.9, 21.6, 36.6, 29, 13.2, 41.5, 8.2, 46.6, 30.6, 48.6, 5.6, 39.3, 30.5, 34.2, 61.5, 4.2, 71.3, 42.5, 32.7, 54.4, 19.2, 13.3, 40.3, 72, 21.8, 49.5, 38.7, 9.6, 49.6, 32, 30.9, 28.6, 30.1, 29.8, 67.9, 60.8, 55, 34.6, 32.8, 11.9, 50.5, 32.1, 13.7, 48.6, 32.6, 9.1, 27.6, 35.6, 28.3, 15.1, 54.7, 30.8, 22.2, 27.5, 49.3, 56, 26.1, 57.2, 46.4, 50.3, 43.6, 51.8, 47.5, 15.5, 60.2, 63.6, 45.3, 14.1, 42.1, 31.4, 42.4, 61.7, 60.1, 27.7, 55.9, 3.3, 18.7, 58.1, 46, 14, 41.7, 28.9, 29.1, 56.9, 32.3, -0.8, 29.4, 27.3, 33.5, 39.1, 13.9, 28.7, 29.4, 10.3, 44.3, 57.1, 76, 49.4, 44.9, 23.2, 53.9, 33.6, 32.7, 30, 57, 63.6, 32.9, 8.6, 26.5, 26, 53.3, 40.8, 30.1, 10.5, 47.2, 30.2, 49.3, 52.4, 48.8, 51.4, 40.7, 33.8, 45.7, 28.1, 13.2, 28.4, 31.7, 30, 29.6, 49.5, 35, 62, 51.9, 39, 15.4, 59.1, 54.8, 9.2, 9.7, 35.4, 32.9, 31.3, 30.4, 64.4, 63.4, 32.9, 40.6, 37.5, 52.3, 35.3, 8.1, 6.4, 26.2, 29.2, 29.7, 27.8, 35.2, 34.1, 29.8, 49, 65.6, -1.1, 28.6, 33.7, 48.1, 45.7, 30.3, 32.7, 64.5, 29.8, 52.5, 48.4, 48.8, 26.4, 37.4, 33.2, 46.1, 29.5, -0.9, 49.8, 34.1, 48.9, 12.5, 36.6, 22.1, 57.3, 9.5, 9.4, 58.5, 50.2, 45.3, 25.3, 27.4, 4.5, 58.5, 63.4, 48.7, 42.6, 33, 47.9, 30.3, 54.9, 7.9, 50.2, 11.2, 59.7, 46.5, 57.5, 26.9, 28.5, 29.7, 52.5, 16.9, 29.8, 28.6, 31.2, 65.3, 1.7, 31.4, 52.5, 5.1, 66.1, 51.5, 9.5, 9.8, 41.6, 0.3, 10.4, 15.5, 34.8, 27.5, 43.6, 31.4, 46.3, 4.6, 45.8, 49.2, 10.7, 48.1, 7.3, 33.4, 10.7, 53.4, 28.9, 51.1, 52.4, 55.9, 56.8, 47.2, 46.8, 30.8, 60.3, 53.6, 30.9, 70.8, 11.2, 7.5, 55.8, 14.3, 25.8, 14.5, 30.9, 60.8, 26.8, 16.5, 31.4, 26.6, 10.6, 53.4, 33.1, 33.1, 46.3, 8.2, 56, 14.1, 25.5, 59.6, 61.9, 58.6, 63.1, 47.7, 30.5, 42.4, 56.2, 17, 13.4, 34.4, 1.1, 18.4, 63.9, 38.6, 15, 30.1, 23.9, 5.9, 53.8, 18.2, 22.7, 45.7, 29.2, 8.4, 52.5, 42, 28.7, 61.7, 35.4, 32.5, 5.5, 6.8, 60.1, 29.4, 31.5, 2.3, 28.3, 29.6, 34.9, 33.2, 28.9, 33.9, 51, 35.4, 52.3, 60, 27.1, 24.7, 57.7, 32.7, 52.5, 66.3, 37.8, 46.3, 38.1, 30.6, 55.6, 44.9, 28.4, 28.9, 19, 7.7, 9.4, 36, 49.9, 42.2, 28.2, 11.5, 52.4, 46.3, 52.4, 27.4, 15.6, 62.3, 51.7, 41.6, 6.2, 10.5, 14.7, 30.4, 23.9, 58.7, 36.1, 47.6, 31.2, 29.1, 60.1, 18, 30, 56.5, 42.7, 27.1, 45.5, 36.6, 46.4, 25.9, 15.4, 31.6, 3.3, 33.6, 63.3, 57.1, 32.3, 11.8, 32.9, 47.2, 31.2, 49.3, 61.7, 11.5, 9.7, 49.6, 45.7, 16.1, 27.4, 22.8, 8.5, 56.2, 26, 45.7, 29, 34.6, 29.4, 3.9, 45.7, 31.7, 52.6, 40.2, 35.5, 5.8, 56.4, 49.5, 30.6, 40.2, 20.8, 43.9, 32.1, 40.8, 45.6, 32.8, 7.4, 27.5, 29.4, 50.8, 43.9, 36.8, 5.5, 61.5, 41.5, 47.5, 13.9, 30.1, 67.3, 27.1, 50.8, 37.4, 28, 25, 37.1, 49.3, 25.3, 26.9, 34.9, 51.8, 33.9, 34.7, 44.2, 10.1, 71.3, 47.5, 23.4, 45.7, 49.4, 32.6, 6.9, 67.8, 56.8, 41.9, 50.7, 31.5, 55, 14.2, 34.8, 26.2, 25.8, 64, 63.8, 56.4, 42.1, 29.5, 49.4, 30.2, 16.2, 30, -0.2, 30.7, 29.6, 57, 41.5, 6.4, 9.7, 47.1, 19.4, 39.8)
xExample <- c(xExample, 32.9, 53.6, 8.4, 32.8, 63.1, 58.4, 7.5, 26, 41.8, 29, 36.9, 41.5, 39.5, 14.1, 27.4, 14.9, 48.4, 34.8, 72.8, 36.9, 27.8, 27.6, 6.1, 43.8, 36.9, 58.5, 55.1, 45.2, 2.6, 20.4, 59, 60.6, 57.7, 29.8, 60.2, 36.9, 29, 28, 46.5, 55, 29.6, 52.6, 38, 45.3, 5.7, 44.8, 35.3, 56.1, 30.3, 32.4, 56.9, 30.8, 44.8, 62.8, 46.1, 57.2, 50.5, 46.4, 37.6, 29.9, 8.6, 35.5, 47.4, 27.2, 36.4, 33.1, 29.4, 25.8, 46, 27.6, 45.7, 32.3, 12.8, 49.8, 13.7, 65.3, 48.5, 39.6, 4, 32.1, 49.6, 44, 74.5, 31, 52.6, 33.3, 56.8, 11.4, 33.7, 34.3, 25.8, 39.8, 7.3, 33.6, 7.9, 49.6, 52.6, 36.5, 43, 14.7, 43.5, 37, 50.8, 46.5, 46.9, 25.4, 32.7, 48.4, 40.3, 45.9, 51.3, 24, 48.3, 39.5, 21.2, 48.1, 56.9, 32.3, 10.2, 9.3, 40.3, 52.8, 34.5, 32.4, 30.1, 10.8, -3.8, 30.4, 58.2, 57.3, 48.9, 36.1, 46.2, 69, 67.8, 58.5, 41.9, 29.6, 51.7, 39.4, 50.8, 29.2, 56.1, 54.4, 17.2, 57.5, 54.1, 48.6, -0.9, 56.3, 27.7, 58.8, 57, 44.1, 6.3, 4.1, 35.9, 60.2, 44.1, 53.9, 33.3, 35.4, 32.1, 56, 56.8, 30.1, 43.1, 64.6, 27.7, 30.7, 53, 66, 29.1, 45, 12.3, 41.3, 54.7, 45.3, 13.3, 9.7, -2, 29.1, 29.5, 31.3, 29.2, 13.8, 26.7, 7.4, 36.8, 42.6, 54.7, 51.3, 42.6, 18, 34, 44.1, 53.6, 44.7, 28.9, 64.9, 60, 66.6, 32.9, 15.5, 37.6, 8.3, 28.5, 16.2, 39.7, 25.9, 8.8, 30.9, 9.9, 39.3, 66.4, 62.4, 53.8, 9.3, 44.7, 50.4, 57.8, 29, 50.1, 28.5, 62.9, 16.3, 54, 45.4, 60.6, 9, 7.7, 64.2, 54.4, 53.3, 45.5, 38, 5.2, 61.7, 10.8, 4.3, 24.8, 26.5, 32.2, 4.5, 49.3, 3.9, 39.6, 26.8, 36.3, 65.1, 59.6, 61.3, 30.1, 65.5, 55.8, 48.2, 49.8, 11.2, 64.2, 29, 44.6, 59.9, 12.6, 51.8, 14.5, 28.8, 49.8, 30.4, 42.7, 2.8, 31.1, 29.2, 27.4, 49.9, 28.2, 59.5, 28.7, 9.4, 30.2, 33.3, 30, 26, 65.1, 55.9, 30.5, 61.1, 50.3, 31.3, 58.2, 41.3, 33.4, 14.8, 51.2, 40.8, 34.1, 33.7, 29.4, 56, 26.4, 30.7, 55.1, 49.7, 37.7, 56.9, 38.5, 28.8, 50.3, 45.7, 13.2, 32.8, 30.5, 30.6, 61.5, 57.7, 33.6, 24.6, 53.9, 36.1, 37.4, 55.5, 27.4, 44.2, 15.4, 56.3, 28.1, 28.8, 67.6, 17.7, 48.5, 57.5, 33.7, 12.9, 19.5, 30.6, 56.8, 75.4, 26, 32.3, 28.3, 10.7, 9, 66.5, 51.6, 30.2, 46, 44.1, 53, 33.9, 28.4, 53.1, 42.3, 55.2, 42.4, 9.4, 36.3, 26.6, 41.2, 33, 42.1, 27, 25.4, 53.8, 56.7, 22.2, 29.5, 30.9, 9.3, 30.4, 48.1, 30.9, 28.4, 38.6, 28.8, 52, 16.5, 64.3, 56.1, 51.4, 50.2, 30.1, 67.3, 62.3, 12.9, 27.9, 38.9, 29.3, 17.4, 30, 62.5, 40.5, 48, 31.9, 54.7, 27.4, 28.2, 46.6, 14, 61.9, 59.4, 65.4, 30.2, 28.9, 35.4, 55.8, 51.4, 47.8, 34, 56.2, 26.5, 30.2, 8.4, 10.9, 63.9, 41.9, 31.3, 52.8, 36, 45.4, -2, 57.3, 80.3, 41, 13.8, 31.9, 33.8, 48.5, 16.7, 29.5, 6.7, 42.1, 32.2, 45.7, 18.9, 30.5, 30.9, 40.2, 14.6, 41.2, 27, 6.1, 34.9, 57.5, 30.1, 56.6, 62.4, 11.5, 25.7, 14.8, 28.2, 43.5, 37.7, 32.1, 44.4, 56.2, 7.6, 29.4, 63.4, 53, 14.6, 50.1, 62.6, 29.3, 33.5, 52.7)
mix_assign <- c(1, 2, 1, 1, 1, 2, 2, 2, 3, 3, 2, 1, 2, 2, 3, 2, 2, 1, 3, 3, 3, 1, 2, 1, 2, 2, 3, 1, 2, 3, 2, 3, 2, 1, 2, 2, 2, 2, 3, 3, 1, 2, 2, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 1, 2, 3, 1, 3, 3, 1, 2, 1, 2, 3, 2, 1, 2, 3, 3, 2, 1, 2, 2, 3, 2, 1, 1, 2, 2, 3, 2, 2, 1, 2, 3, 2, 2, 3, 3, 2, 2, 2, 3, 3, 1, 2, 3, 1, 2, 2, 1, 3, 2, 3, 1, 2, 3, 1, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 3, 2, 2, 2, 3, 2, 1, 1, 2, 2, 1, 2, 3, 3, 2, 3, 1, 3, 3, 2, 2, 1, 3, 3, 1, 2, 2, 2, 2, 2, 1, 2, 3, 2, 2, 2, 2, 3, 1, 3, 3, 2, 2, 3, 1, 2, 3, 2, 2, 2, 2, 2, 3, 2, 3, 1, 3, 3, 3, 3, 2, 3, 2, 2, 2, 1, 2, 2, 1, 1, 3, 3, 3, 3, 2, 2, 3, 2, 3, 2, 2, 1, 1, 3, 3, 3, 3, 3, 3, 2, 2, 2, 1, 3, 3, 2, 2, 3, 3, 2, 3, 2, 2, 2, 3, 2, 3, 2, 3, 1, 2, 3, 2, 1, 3, 3, 2, 1, 1, 2, 2, 2, 2, 3, 1, 2, 2, 2, 2, 2, 2, 3, 2, 1, 2, 1, 2, 2, 2, 3, 3, 3, 2, 1, 3, 3, 2, 2, 1, 3, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 3, 3, 2, 3, 2, 1, 2, 2, 1, 2, 1, 3, 1, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 3, 2, 1, 1, 2, 1, 3, 1, 3, 2, 3, 1, 3, 3, 1, 2, 3, 2, 2, 1, 2, 1, 3, 2, 2, 2, 2, 2, 3, 2, 2, 1, 1, 3, 1, 1, 2, 3, 1, 3, 3, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 1, 3, 3, 2, 3, 3, 3, 2, 3, 2, 2, 3, 3, 2, 3, 2, 2, 2, 2, 2, 3, 2, 2, 3, 3, 1, 1, 1, 2, 2, 2, 3, 1, 2, 2, 2, 3, 1, 2, 2, 2, 1, 1, 1, 3, 3, 2, 3, 2, 3, 2, 2, 1, 3, 2, 2, 3, 2, 3, 2, 3, 1, 3, 1, 3, 2, 2, 3, 1, 3, 2, 3, 2, 2, 1, 1, 2, 2, 1, 3, 3, 1, 2, 3, 2, 3, 2, 3, 1, 2, 2, 2, 2, 3, 1, 2, 2, 3, 2, 2, 2, 3, 2, 2, 3, 1, 3, 3, 2, 2, 2, 1, 2, 2, 2, 1, 3, 2, 3, 2, 2, 3, 3, 2, 2, 3, 3, 3, 2, 3, 3, 2, 1, 2, 2, 3, 2, 2, 3, 1, 2, 2, 2, 2, 3, 2, 1, 3, 3, 2, 2, 2, 2, 2, 3, 2, 3, 1, 3, 1, 3, 3, 2, 2, 1, 1, 2, 2, 2, 3, 2, 1, 3, 2, 2, 1, 3, 2, 3, 2, 2, 2, 1)
mix_assign <- c(mix_assign, 3, 1, 2, 3, 2, 3, 3, 3, 1, 2, 3, 2, 2, 2, 1, 2, 2, 2, 2, 3, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 3, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 3, 2, 3, 2, 3, 3, 3, 2, 3, 2, 3, 1, 2, 1, 2, 2, 2, 1, 3, 2, 2, 2, 3, 2, 3, 2, 1, 3, 2, 3, 2, 1, 2, 1, 2, 2, 3, 2, 1, 2, 3, 2, 2, 2, 3, 3, 2, 2, 2, 2, 3, 2, 2, 1, 2, 2, 3, 1, 1, 2, 2, 3, 3, 3, 1, 1, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2, 2, 2, 1, 2, 3, 2, 2, 2, 1, 1, 3, 2, 2, 2, 3, 3, 3, 2, 2, 3, 2, 2, 3, 3, 2, 2, 3, 2, 1, 2, 2, 2, 1, 1, 1, 3, 3, 3, 3, 1, 3, 1, 2, 2, 2, 2, 2, 1, 3, 2, 2, 2, 3, 2, 2, 2, 3, 1, 2, 1, 3, 1, 2, 3, 1, 3, 1, 3, 2, 2, 2, 1, 2, 2, 2, 3, 2, 3, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 1, 1, 3, 3, 3, 1, 2, 1, 2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2, 1, 2, 3, 2, 2, 1, 2, 1, 3, 2, 3, 2, 1, 3, 3, 3, 2, 3, 2, 3, 1, 2, 2, 3, 3, 2, 2, 3, 2, 2, 3, 2, 2, 3, 1, 2, 2, 3, 3, 3, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 3, 3, 3, 2, 2, 3, 3, 2, 2, 2, 2, 3, 2, 1, 2, 3, 3, 2, 1, 2, 2, 2, 1, 1, 3, 2, 2, 3, 3, 3, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 3, 3, 2, 3, 2, 3, 3, 2, 2, 2, 3, 2, 1, 3, 2, 3, 3, 2, 3, 2, 1, 2, 2, 2, 2, 3, 2, 2, 1, 3, 2, 2, 1, 3, 2, 2, 2, 3, 2, 3, 3, 2, 1, 2, 2, 2, 3, 3, 2, 2, 2, 2, 2, 2, 3, 3, 1, 1, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 3, 3, 2, 1, 3, 1, 2, 3, 2, 1, 2, 3, 2, 1, 2, 3, 1, 3, 2, 3, 2, 2, 1, 3, 1, 3, 2, 2, 3, 2, 2, 1, 3, 2, 2, 2, 2, 2, 3, 3, 2)

mix_example <- data.frame(x=xExample, assignment=mix_assign)
str(mix_example)
## 'data.frame':    1000 obs. of  2 variables:
##  $ x         : num  7.3 58.7 9.7 16.9 6.3 35.1 33.5 61.3 28.3 24.3 ...
##  $ assignment: num  1 2 1 1 1 2 2 2 3 3 ...
library(flexmix)
## Loading required package: lattice
set.seed(1515)

# If wanting verbose output
# control = list(tolerance = 1e-15, verbose = 1, iter = 1e4)
fit_mix_example <- flexmix(x ~ 1, data = mix_example, k = 3, model = FLXMCnorm1(), 
                           control = list(tolerance = 1e-15, iter = 1e4)
                           )

proportions <- prior(fit_mix_example)
comp_1 <- parameters(fit_mix_example, component = 1)
comp_2 <- parameters(fit_mix_example, component = 2)
comp_3 <- parameters(fit_mix_example, component = 3)


fun_prop <- function(x, mean, sd, proportion){
    proportion * dnorm(x = x, mean = mean, sd = sd)
}

ggplot(mix_example) + 
    geom_histogram(aes(x = x, y = ..density..)) + 
    stat_function(geom = "line", fun = fun_prop, 
                  args = list(mean = comp_1[1], sd = comp_1[2], proportion = proportions[1])
                  ) +
    stat_function(geom = "line", fun = fun_prop, 
                  args = list(mean = comp_2[1], sd = comp_2[2], proportion = proportions[2])
                  ) +
    stat_function(geom = "line", fun = fun_prop, 
                  args = list(mean = comp_3[1], sd = comp_3[2], proportion = proportions[3])
                  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Explore the first assignments
head(clusters(fit_mix_example))
## [1] 2 1 2 2 2 3
# Explore the first real labels
head(mix_example$assignment)
## [1] 1 2 1 1 1 2
# Create frequency table
table(mix_example$assignment, clusters(fit_mix_example))
##    
##       1   2   3
##   1   0 184   1
##   2 464   5  37
##   3  18   2 289
genderData <- readr::read_csv("./RInputFiles/gender.csv")
## Parsed with column specification:
## cols(
##   Gender = col_character(),
##   Height = col_double(),
##   Weight = col_double(),
##   BMI = col_double(),
##   probability = col_double()
## )
str(genderData)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 10000 obs. of  5 variables:
##  $ Gender     : chr  "Male" "Male" "Male" "Male" ...
##  $ Height     : num  73.8 68.8 74.1 71.7 69.9 ...
##  $ Weight     : num  242 162 213 220 206 ...
##  $ BMI        : num  0.0444 0.0343 0.0387 0.0428 0.0423 ...
##  $ probability: num  5.78e-06 6.06e-01 2.63e-05 3.63e-04 4.61e-03 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Gender = col_character(),
##   ..   Height = col_double(),
##   ..   Weight = col_double(),
##   ..   BMI = col_double(),
##   ..   probability = col_double()
##   .. )
set.seed(1313)
fit_with_covariance <- flexmix(cbind(Weight, BMI) ~ 1, data = genderData, k = 2, 
                               model = FLXMCmvnorm(diag = FALSE), 
                               control = list(tolerance = 1e-15, iter.max = 1000)
                               )

# Get the parameters
comp_1 <- parameters(fit_with_covariance, component = 1)
comp_2 <- parameters(fit_with_covariance, component = 2)

# The means
mean_comp_1 <- comp_1[1:2]
mean_comp_1
## [1] 135.97738684   0.03334097
mean_comp_2 <- comp_2[1:2]
mean_comp_2
## [1] 186.6849545   0.0391035
# The covariance matrices
covariance_comp_1 <- matrix(comp_1[3:6], nrow = 2)
covariance_comp_1
##              [,1]         [,2]
## [1,] 370.85097459 4.712215e-02
## [2,]   0.04712215 8.103393e-06
covariance_comp_2 <- matrix(comp_2[3:6], nrow = 2)
covariance_comp_2
##              [,1]         [,2]
## [1,] 405.22840544 2.742036e-02
## [2,]   0.02742036 4.668417e-06
# Create ellipse curve 1
ellipse_comp_1 <- ellipse::ellipse(x = covariance_comp_1, centre = mean_comp_1, npoints = nrow(genderData))
head(ellipse_comp_1)
##             x          y
## [1,] 181.4301 0.04005980
## [2,] 181.4223 0.04006096
## [3,] 181.4144 0.04006212
## [4,] 181.4065 0.04006327
## [5,] 181.3986 0.04006442
## [6,] 181.3906 0.04006557
# Create ellipse curve 2
ellipse_comp_2 <- ellipse::ellipse(x = covariance_comp_2, centre = mean_comp_2, npoints = nrow(genderData))
head(ellipse_comp_2)
##             x          y
## [1,] 231.1740 0.04387866
## [2,] 231.1607 0.04388009
## [3,] 231.1473 0.04388151
## [4,] 231.1340 0.04388294
## [5,] 231.1206 0.04388436
## [6,] 231.1072 0.04388578
# Plot the ellipses
genderData %>% 
    ggplot(aes(x = Weight, y = BMI)) + geom_point()+
    geom_path(data = data.frame(ellipse_comp_1), aes(x=x,y=y), col = "red") +
    geom_path(data = data.frame(ellipse_comp_2), aes(x=x,y=y), col = "blue")

# Check the assignments
table(genderData$Gender, clusters(fit_with_covariance))
##         
##             1    2
##   Female 4540  460
##   Male    386 4614

Chapter 4 - Mixture Models Beyond Gaussians

Bernoulli Mixture Models:

  • Example of the handwritten images dataset, with images of 3 and 6
    • The Bernoulli distribution is appropriate for a binary outcome when you do not want a continuous prediction but rather a probability of success
    • p <- 0.7
    • bernoulli <- sample(c(0, 1), 100, replace = TRUE, prob = c(1-p, p))
  • Every pixel in an image can be represented by a Bernoulli with outcomes being either black (filled) or white (empty), with a 16x16 considered to be a 256-length vector; example for a 3-pixel image
    • p1 <- 0.7; p2 <- 0.5; p3 <- 0.4
    • bernoulli_1 <- sample(c(0, 1), 100, replace = TRUE, prob = c(1-p1, p1))
    • bernoulli_2 <- sample(c(0, 1), 100, replace = TRUE, prob = c(1-p2, p2))
    • bernoulli_3 <- sample(c(0, 1), 100, replace = TRUE, prob = c(1-p3, p3))
    • multi_bernoulli <- cbind(bernoulli_1, bernoulli_2, bernoulli_3)
    • p_vector <- c(p1, p2, p3)

Bernoulli Mixture Models with flexmix:

  • Example of using flexmix to predict whether a digit is a 3 or a 6 in the 16x16 data that is stored in digits
    • digits_sample <- as.matrix(digits) # 320 x 256 (flattened array of pixels per image)
    • show_digit(digits_sample[320,])
  • For fitting a Bernoulli mixture model with two segments, can proceed similar to before
    • bernoulli_mix_model <- flexmix(digits_sample~1, k=2, model=FLXMCmvbinary(), control = list(tolerance = 1e-15, iter.max = 1000)) # digits_sample is the matrix, with each row as an image
    • prior(bernoulli_mix_model) # proportions in each cluster
    • param_comp1 <- parameters(bernoulli_mix_model, component = 1)
    • param_comp2 <- parameters(bernoulli_mix_model, component = 2)
    • show_digit(param_comp1) # will show probabilities of pixelation given segment is a 3

Poisson Mixture Models:

  • Can use Chicago crimes data with a Poisson model, since the goal is a count of crimes by type by community, with segments by level of crime danger in the community
    • lambda_1 <- 100; lambda_2 <- 200; lambda_3 <- 300
    • poisson_1 <- rpois(n = 100, lambda = lambda_1)
    • poisson_2 <- rpois(n = 100, lambda = lambda_2)
    • poisson_3 <- rpois(n = 100, lambda = lambda_3)
    • multi_poisson <- cbind(poisson_1, poisson_2, poisson_3)
  • Can extend the general concept to a crime dataset with 13 columns
    • Multi-Poisson distribution
    • Try from 1-15 clusters, and minimize BIC criteria
    • Parameters include each lambda for each multi-Poisson distribution

Poisson Mixture Models with flexmix:

  • Example of solving the Poisson mixture model using flexmix, with indeterminate clusters (try 1-15 and minimize BIC), and with parameters of lambdas by cluster and proportions by cluster
    • crimes_matrix <- as.matrix(crimes[,-1]) # do not include the community names
    • poisson_mix_model <- stepFlexmix(crimes_matrix ~ 1, k = 1:15, nrep = 5, model = FLXMCmvpois(), control = list(tolerance = 1e-15, iter = 1000)) # stepFlexMix is how to run 1:15
    • best_fit <- getModel(poisson_mix_model, which = “BIC”) # can also use AIC or ICL
    • prior(best_fit) # proportions
    • param_pmm <- data.frame(parameters(best_fit)) # parameters by cluster, converted to data frame
    • param_pmm <- param_pmm %>% mutate(Type = colnames(crimes_matrix))
    • head(param_pmm)
    • param_pmm %>% gather(Components, Lambda, -Type) %>% ggplot(aes(x = Type, y = Lambda, fill = Type)) + geom_bar(stat = “identity”) + facet_wrap(~ Components) + theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = “none”)
    • crimes_c <- crimes %>% mutate(CLUSTER = factor(clusters(best_fit)))
    • crimes_c %>% group_by(CLUSTER) %>% mutate(NUMBER = row_number()) %>% ggplot(aes(x = CLUSTER, y = NUMBER, col = CLUSTER)) + geom_text(aes(label = COMMUNITY), size = 2.3)+ theme(legend.position=“none”)

Example code includes:

# Create the vector of probabilities
p_cluster_1 <- c(0.8, 0.8, 0.2, 0.9)

# Create the sample for each pixel
set.seed(18102308)
pixel_1 <- sample(c(0, 1), 100, replace = TRUE, prob = c(1-p_cluster_1[1], p_cluster_1[1]))
pixel_2 <- sample(c(0, 1), 100, replace = TRUE, prob = c(1-p_cluster_1[2], p_cluster_1[2]))
pixel_3 <- sample(c(0, 1), 100, replace = TRUE, prob = c(1-p_cluster_1[3], p_cluster_1[3]))
pixel_4 <- sample(c(0, 1), 100, replace = TRUE, prob = c(1-p_cluster_1[4], p_cluster_1[4]))

# Combine the samples
sample_cluster_1 <- cbind(pixel_1, pixel_2, pixel_3, pixel_4)

# Have a look to the sample
head(sample_cluster_1)
##      pixel_1 pixel_2 pixel_3 pixel_4
## [1,]       1       1       0       1
## [2,]       1       1       0       1
## [3,]       1       1       0       1
## [4,]       1       1       0       1
## [5,]       0       1       0       1
## [6,]       1       1       0       1
digitUse2 <- rowSums(digitKey[, c(1, 3, 10)]) == 1
digits_sample_2 <- digitData[digitUse2, ]
dim(digits_sample_2)
## [1] 478 256
# transform into matrix
digits_sample_2 <- as.matrix(digits_sample_2)

# dimension
dim(digits_sample_2)
## [1] 478 256
# look to the first observation
show_digit(digits_sample_2[1, ])

# look to the last observation
show_digit(digits_sample_2[nrow(digits_sample_2), ])

set.seed(1513)
# Fit Bernoulli mixture model
bernoulli_mix_model <- flexmix(digits_sample_2 ~ 1, k = 3, model = FLXMCmvbinary(), 
                               control = list(tolerance = 1e-15, iter.max = 1000)
                               )
prior(bernoulli_mix_model)
## [1] 0.3117220 0.3353131 0.3529649
# Extract the parameters for each cluster
param_comp_1 <- parameters(bernoulli_mix_model, component = 1)
param_comp_2 <- parameters(bernoulli_mix_model, component = 2)
param_comp_3 <- parameters(bernoulli_mix_model, component = 3)

# Visualize the clusters
show_digit(param_comp_1)

show_digit(param_comp_2)

show_digit(param_comp_3)

set.seed(1541)

# Create the vector of lambdas
lambda_1 <- c(150, 300, 50)

# Create the sample of each crime
assault_1 <- rpois(n = 10, lambda = lambda_1[1])
robbery_1 <- rpois(n = 10, lambda = lambda_1[2])
battery_1 <- rpois(n = 10, lambda = lambda_1[3])

# Combine the results
cities_1 <- cbind(assault_1, robbery_1, battery_1)

# Check the sample
cities_1
##       assault_1 robbery_1 battery_1
##  [1,]       154       297        55
##  [2,]       142       276        50
##  [3,]       166       312        41
##  [4,]       158       273        56
##  [5,]       129       278        52
##  [6,]       150       307        39
##  [7,]       140       321        49
##  [8,]       152       321        53
##  [9,]       143       318        42
## [10,]       125       324        51
crimes <- readr::read_csv("./RInputFiles/CoC_crimes.csv")
## Parsed with column specification:
## cols(
##   COMMUNITY = col_character(),
##   ASSAULT = col_double(),
##   BATTERY = col_double(),
##   BURGLARY = col_double(),
##   `CRIMINAL DAMAGE` = col_double(),
##   `CRIMINAL TRESPASS` = col_double(),
##   `DECEPTIVE PRACTICE` = col_double(),
##   `MOTOR VEHICLE THEFT` = col_double(),
##   NARCOTICS = col_double(),
##   OTHER = col_double(),
##   `OTHER OFFENSE` = col_double(),
##   ROBBERY = col_double(),
##   THEFT = col_double()
## )
dim(crimes)
## [1] 77 13
names(crimes) <- stringr::str_replace_all(stringr::str_to_lower(names(crimes)), " ", ".")

# Check with glimpse
glimpse(crimes)
## Observations: 77
## Variables: 13
## $ community           <chr> "ALBANY PARK", "ARCHER HEIGHTS", "ARMOUR S...
## $ assault             <dbl> 123, 51, 74, 169, 708, 1198, 118, 135, 337...
## $ battery             <dbl> 429, 134, 184, 448, 1681, 3347, 280, 350, ...
## $ burglary            <dbl> 147, 92, 55, 194, 339, 517, 76, 145, 327, ...
## $ criminal.damage     <dbl> 287, 114, 99, 379, 859, 1666, 150, 310, 52...
## $ criminal.trespass   <dbl> 38, 23, 56, 43, 228, 265, 29, 36, 88, 29, ...
## $ deceptive.practice  <dbl> 137, 67, 59, 178, 310, 767, 73, 200, 314, ...
## $ motor.vehicle.theft <dbl> 176, 50, 37, 189, 281, 732, 58, 123, 411, ...
## $ narcotics           <dbl> 27, 18, 9, 30, 345, 1456, 15, 22, 119, 10,...
## $ other               <dbl> 107, 37, 48, 114, 584, 1261, 76, 88, 238, ...
## $ other.offense       <dbl> 158, 44, 35, 164, 590, 1130, 94, 142, 339,...
## $ robbery             <dbl> 144, 30, 98, 111, 349, 829, 65, 109, 172, ...
## $ theft               <dbl> 690, 180, 263, 461, 1201, 2137, 239, 669, ...
# Transform into a matrix, without `community`
matrix_crimes <- crimes %>%
  select(-community) %>%  
  as.matrix()

# Check the first values
head(matrix_crimes)
##      assault battery burglary criminal.damage criminal.trespass
## [1,]     123     429      147             287                38
## [2,]      51     134       92             114                23
## [3,]      74     184       55              99                56
## [4,]     169     448      194             379                43
## [5,]     708    1681      339             859               228
## [6,]    1198    3347      517            1666               265
##      deceptive.practice motor.vehicle.theft narcotics other other.offense
## [1,]                137                 176        27   107           158
## [2,]                 67                  50        18    37            44
## [3,]                 59                  37         9    48            35
## [4,]                178                 189        30   114           164
## [5,]                310                 281       345   584           590
## [6,]                767                 732      1456  1261          1130
##      robbery theft
## [1,]     144   690
## [2,]      30   180
## [3,]      98   263
## [4,]     111   461
## [5,]     349  1201
## [6,]     829  2137
set.seed(2017)
# Fit the Poisson mixture model
poisson_mm <- stepFlexmix(matrix_crimes ~ 1, k = 1:15, nrep = 5, model = FLXMCmvpois(), 
                          control = list(tolerance = 1e-15, iter.max = 1000)
                          )
## 1 : * * * * *
## 2 : * * * * *
## 3 : * * * * *
## 4 : * * * * *
## 5 : * * * * *
## 6 : * * * * *
## 7 : * * * * *
## 8 : * * * * *
## 9 : * * * * *
## 10 : * * * * *
## 11 : * * * * *
## 12 : * * * * *
## 13 : * * * * *
## 14 : * * * * *
## 15 : * * * * *
# Select the model that minimize the BIC
best_poisson_mm <- getModel(poisson_mm, which = "BIC")

# Get the parameters into a data frame
params_lambdas <- data.frame(parameters(best_poisson_mm))

# Add the column with the type of crime
params_lambdas_crime <- params_lambdas %>% 
    mutate(crime = colnames(matrix_crimes))

# Plot the clusters with their lambdas
params_lambdas_crime %>% 
    gather(cluster, lambdas, -crime) %>% 
    ggplot(aes(x = crime, y = lambdas, fill = crime)) + 
    geom_bar(stat = "identity") +
    facet_wrap(~ cluster) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none")

# Add the cluster assignments
crimes_with_clusters <- crimes %>% 
    mutate(cluster = factor(clusters(best_poisson_mm)))

# Plot the clusters with the communities
crimes_with_clusters %>% 
    group_by(cluster) %>% 
    mutate(number = row_number()) %>% 
    ggplot(aes(x = cluster, y = number, col = cluster)) + 
    geom_text(aes(label = community), size = 2.3) +
    theme(legend.position="none")


Developing R Packages

Chapter 1 - The R Package Structure

Introduction to Package Building:

  • R packages can include functions, data, documentation, vignettes, tests, and the like
  • At a minimum, the package must have 1) R Directory, 2) man Directory, 3) NAMESPACE file, and 4) DESCRIPTION file
  • Can use devtools to help with package building, as well as roxygen2
    • create()
    • document()
    • check()
    • build()
    • test()
  • The devtools functions have descriptive names closely related to their assigned tasks
    • devtools::create(“simutils”) # avoid names that are already on CRAN

Description and Namespace Files:

  • The DESCRIPTION contains background information such as author, package name, version, license, and the like
  • The NAMESPACE file will be edited based on entries made in other locations
    • import() will bring in from other packages
    • export() will export to the calling environment

Optional Directories:

  • The most common optional directories include data, vignettes, tests, compiled code, and translations
  • Example for adding data to the package
    • sim_dat <- data.frame( ID = 1:10, Value = sample(1:11, 10), Apples = sample(c(TRUE, FALSE), 10, replace = TRUE) )
    • devtools::use_data(sim_dat, pkg = “simutils”)
  • Example for adding vignettes to the package
    • use_vignette(“my_first_vignette”, pkg = “simutils”)
  • Best practices for structuring code - may NOT include subdirectories
    • Group similar functions together in a file
    • Do not create a small file for every function

Example code includes:

# Use the create function to set up your first package
devtools::create("./RPackages/datasummary")

# Take a look at the files and folders in your package
dir("./RPackages/datasummary")


# Create numeric_summary() function
numeric_summary <- function(x, na.rm) {
    # Include an error if x is not numeric
    if(!is.numeric(x)){
        stop("Data must be numeric")
    }
    
    # Create data frame
    data.frame( min = min(x, na.rm = na.rm),
                median = median(x, na.rm = na.rm),
                sd = sd(x, na.rm = na.rm),
                max = max(x, na.rm = na.rm))
}

data(airquality)

# Test numeric_summary() function
numeric_summary(airquality$Ozone, TRUE)


# What is in the R directory before adding a function?
dir("./RPackages/datasummary/R")

# Use the dump() function to write the numeric_summary function
dump("numeric_summary", file = "./RPackages/datasummary/R/numeric_summary.R")

# Verify that the file is in the correct directory
dir("./RPackages/datasummary/R")


# a package should not have the same name as an existing package and its name must only contain letters, numbers, or dots.


# What is in the package at the moment?
dir("./RPackages/datasummary")

# Add the weather data
data(Weather, package="mosaicData")
devtools::use_data(Weather, pkg = "./RPackages/datasummary")

# Add a vignette called "Generating Summaries with Data Summary"
devtools::use_vignette("Generating_Summaries_with_Data_Summary", pkg = "./RPackages/datasummary")

# What directories do you now have in your package now?
dir("./RPackages/datasummary")


data_summary <- function(x, na.rm = TRUE){
  num_data <- select_if(x, .predicate = is.numeric) 
  map_df(num_data, .f = numeric_summary, na.rm = TRUE, .id = "ID")
}

# Write the function to the R directory
dump("data_summary", file = "./RPackages/datasummary/R/data_summary.R")
dir("./RPackages/datasummary")

Chapter 2 - Documenting Packages

Introduction to roxygen2:

  • Good documentation is key to the package - all functions, usages, outputs, etc.
  • The roxygen2 package allows for creating this type of documentation
    • Included above the function, with each line starting with #’
    • Paragraph 1 is the title (should be short)
    • Paragraph 2 is the brief description (should be a single sentence)
    • Paragraph 3 is the longer description (an actual paragraph of a few sentences)
    • Paragraphs 4+ need to all start with @tag where tag might be param or author or import or return or export or examples or etc
  • Packages need to import other packages rather than calling them by way of library(package)

How to export functions:

  • Exported functions become visible to the end-user and are core to package functionality (documented)
    • Flagged with a tag of #’ @export
  • Non-exported functions are not visible to the end-user but instead serve as utility functions within the package
    • Flagged by having no tag of #’ @export
  • Only the exported functions get loaded, so calling a non-exported function requires a triple colon from the namespace
    • simutils:::sum_na(airquality$Ozone)

Documenting other elements:

  • Can include examples using the @examples tag with the example included on the line below
    • If you do not want an example to be run during checking, then the tag can be used with the example being inside {} and allowed to be multiple lines
  • The return from the function is specified using the @return tag, and with the tag being available to signal that this refers to one of the arguments/parameters
  • Can use the @author and @seealso tags as needed

Documenting a package:

  • All roxygen headers need to be followed by some form of R code
    • Can use a single line “_PACKAGE" for this
  • The minimum level of documentation is title, description, arguments, and exported (for exported functionsl only)
    • Other components may be useful to the end-user
  • Can put a data frame from the current environment in to the package
    • use_data(sim_dat, pkg = “simutils”)
  • Can also use the roxygen formatting and then conclude with “sim_dat” to put this in with headers and descriptors
  • Can create the manual files (create and/or update) using
    • document(“simutils”)

Example code includes:

#' Summary of Numeric Columns
#'
#' Generate specific summaries of numeric columns in a data frame
#' 
#' @param x A data frame. Non-numeric columns will be removed
#' @param na.rm A logical indicating whether missing values should be removed
#' @import purrr
#' @import dplyr
#' @importFrom tidyr gather
data_summary <- function(x, na.rm = TRUE){
  
  num_data <- select_if(x, .predicate = is.numeric) 
  
  map_df(num_data, .f = numeric_summary, na.rm = na.rm, .id = "ID")
  
}


#' Summary of Numeric Columns
#'
#' Generate specific summaries of numeric columns in a data frame
#' 
#' @param x A data frame. Non-numeric columns will be removed
#' @param na.rm A logical indicating whether missing values should be removed
#' @import dplyr
#' @import purrr
#' @importFrom tidyr gather
#' @export
data_summary <- function(x, na.rm = TRUE){
  
  num_data <- select_if(x, .predicate = is.numeric) 
  
  map_df(num_data, .f = numeric_summary, na.rm = na.rm, .id = "ID")
  
}


#' Data Summary for Numeric Columns
#'
#' Custom summaries of numeric data in a provided data frame
#'
#' @param x A data.frame containing at least one numeric column
#' @param na.rm A logical indicating whether missing values should be removed
#' @import dplyr
#' @import purrr
#' @importFrom tidyr gather
#' @export
#' @examples
#' data_summary(iris)
#' data_summary(airquality, na.rm = FALSE)
data_summary <- function(x, na.rm = TRUE){
  
  num_data <- select_if(x, .predicate = is.numeric) 
  
  map_df(num_data, .f = numeric_summary, na.rm = na.rm, .id = "ID")
  
}



# For code you use \code{text to format}
# To link to other functions you use \link[packageName]{functioName}, although note the package name is only required if the function is not in your package
# To include an unordered list you use \itemize{}. Inside the brakets you mark new items with \item followed by the item text.

#' Data Summary for Numeric Columns
#'
#' Custom summaries of numeric data in a provided data frame
#'
#' @param x A data.frame containing at least one numeric column
#' @param na.rm A logical indicating whether missing values should be removed
#' @import dplyr
#' @import purrr
#' @importFrom tidyr gather
#' @export
#' @examples
#' data_summary(iris)
#' data_summary(airquality, na.rm = FALSE)
#'
## Update the details for the return value
#' @return 
#' This function returns a \code{data.frame} including columns:
#' \itemize{
#'  \item ID
#'  \item min
#'  \item median
#'  \item sd
#'  \item max
#' }
#'
#' @export
data_summary <- function(x, na.rm = TRUE){
  
  num_data <- select_if(x, .predicate = is.numeric) 
  
  map_df(num_data, .f = numeric_summary, na.rm = na.rm, .id = "ID")
  
}


#' Summary of Numeric Columns
#' Generate specific summaries of numeric columns in a data frame
#'
#' @param x A data frame. Non-numeric columns will be removed
#' @param na.rm A logical indicating whether missing values should be removed
#' @import dplyr
#' @import purrr
#' @importFrom tidyr gather
#' @export
#' @examples
#' data_summary(iris)
#' data_summary(airquality, na.rm = FALSE)
#' 
#' @return This function returns a \code{data.frame} including columns: 
#' \itemize{
#'  \item ID
#'  \item min
#'  \item median
#'  \item sd
#'  \item max
#' }
#'
## Add in the author of the `data_summary()` function. 
#' @author My Name <myemail@example.com>
## Update the header to link to the `summary()` function (in the `base` package).
#' @seealso \link[base]{summary}
data_summary <- function(x, na.rm = TRUE){
  
  num_data <- select_if(x, .predicate = is.numeric) 
  
  map_df(num_data, .f = numeric_summary, na.rm = na.rm, .id = "ID")
  
}


#' Custom Data Summaries
#' 
#' Easily generate custom data frame summaries
#' 
#' @docType package
#' @name datasummary
_PACKAGE 


#' Random Weather Data
#'
#' A dataset containing randomly generated weather data.
#'
#' @format A data frame of 7 rows and 3 columns
#' \describe{
#'  \item{Day}{Numeric values giving day of the week, 1 = Monday, 7 = Sunday}
#'  \item{Temp}{Integer values giving temperature in degrees Celsius}
#'  \item{Weather}{Character values giving precipitation type, Sun if none}
#' }
#' @source Randomly generated data
weather


# Generate package documentation
document("datasummary")

# Examine the contents of the man directory
dir("datasummary/man")

# View the documentation for the data_summary function
help("data_summary")

# View the documentation for the weather dataset
help("weather")

Chapter 3 - Checking and Building R Packages

Why check an R package?

  • There are many checks for a typical R package, including many that are mandatory for submission to CRAN
    • Package can be installed
    • Description information is correct
    • Dependencies make sense
    • No code syntax errors
    • Documentation is complete
    • Unit tests can be run
    • Vignettes can be built
  • A system-level R tool can check the package for you
    • check(“simutils”) # default setting is the same as for CRAN
    • Errors need to be fixed, while warnings are less problematic (still should be investigated)

Errors, warnings, and notes:

  • Package dependencies will be checked; will error out if these are not available
  • Often have documentation issues; need to update function and argument names to match what has been updated in code
  • The examples are all run, and need to return with an OK status
  • If LaTEX is not installed, there will be documentation build errors

Differences in package dependencies:

  • The dependencies from “Depends:” are loaded in to the search() path, though this is not always the recommended approach due to masking
  • The dependencies from “Imports:” are loaded by a namespace
    • use_package(“dplyr”) ## adds to imports
  • The “Suggests:” are not required for running the package but may be helpful (for example, for running the vignettes)
    • use_package(“ggplot2”, “suggests”) ## adds to suggests

Building packages with continuous integration:

  • Can create package as either a source file or as a binary
    • build(“simutils”) # default builds the source file
    • build(“simutils”, binary = TRUE) # builds the binary version, needed for compiled code
  • Continuous integration can help with package maintenance
    • Automatically checks packages whenever code is changed
    • Useful with version control
    • use_travis(“simutils”)

Example code includes:

# Check your package
check("datasummary")


#' Numeric Summaries
#' Summarises numeric data and returns a data frame containing the minimum value, median, standard deviation, and maximum value.
#'
#' @param x a numeric vector containing the values to summarize.
#' @param na.rm a logical value indicating whether NA values should be stripped before the computation proceeds.
numeric_summary <- function(x, na.rm){

  if(!is.numeric(x)){
    stop("data must be numeric")
  }

  data.frame( min = min(x, na.rm = na.rm),
              median = median(x, na.rm = na.rm),
              sd = sd(x, na.rm = na.rm),
              max = max(x, na.rm = na.rm))
}


# The way in which you define variables in tidyverse package functions can cause confusion for the R CMD check, which sees column names and the name of your dataset, and flags them as "undefined global variables".
# To get around this, you can manually specify the data and its columns as a vector to utils::globalVariables(), by including a line of code similar to the following in your package-level documentation:
# utils::globalVariables(c("dataset_name", "col_name_1", "col_name_2"))
# This defines dataset_name, col_name_1, and col_name_2 as global variables, and now you shouldn't get the undefined global variables error.

#' datasummary: Custom Data Summaries
#'
#' Easily generate custom data frame summaries
#'
#' @docType package
#' @name datasummary
_PACKAGE

# Update this function call
utils::globalVariables(c("weather", "Temp"))


# Add dplyr as an imported dependency to the DESCRIPTION file
use_package("dplyr", pkg = "datasummary")

# Add purrr as an imported dependency to the DESCRIPTION file
use_package("purrr", pkg = "datasummary")

# Add tidyr as an imported dependency to the DESCRIPTION file
use_package("tidyr", pkg = "datasummary")


# Build the package
build("datasummary")

# Examine the contents of the current directory
dir("datasummary")

Chapter 4 - Adding Unit Tests to R Packages

What are unit tests and why write them?

  • Unit tests can help with checking that future function behavior remains as expected
    • Changes in functionality of supporting code
    • Later versions of R
    • Different operating systems
    • Different underlying data
  • Can add unit tests to the package using roxygen2
    • Call use_testthat to set up the test framework
    • This creates a test directory in the package root directory
    • Within the test directory, there is a script testthat.R which contains code to run the tests
    • Within the test directory is a directory testthat where you save all of your test scripts
  • Can then create individual tests that begin with expect_*()
    • library(testthat)
    • my_vector <- c(“First” = 1, “Second” = 2)
    • expect_identical(my_vector, c(“First” = 1, “Second” = 2)) # will pass since types and values are the same
    • expect_identical(myvector, c(1, 2)) # Error: vec1 not identical to c(1, 2). names for target but not for current
    • expect_equal(my_vector, c(“First” = 1L, “Second” = 2L)) # checks only the values and attributes, so this will pass even if vector is c(1, 2)
    • expect_equal(my_vector, c(First = 1.1, Second = 2.1), tolerance = 0.1) # include a tolerance for differences
    • expect_equivalent(my_vector, c(1, 2)) # compare only the values, not the types or the attributes

Testing errors and warnings:

  • Functions can give warnings or errors for many reasons, such as sqrt(-1)
  • The testthat contains functions for expecting warnings and errors
    • expect_warning(sqrt(-1)) # passes
    • expect_error(sqrt(“foo”)) # passes
    • expect_error(sqrt(-1)) # fails, since this returns a warning rather than an error
    • expect_error(sqrt(“foo”), “non-numeric argument to mathematical function”) # requires that the error thrown be exactly the string in the second argument
    • expect_error(sqrt(“foo”), “NaNs produced”) # fails, since that is not the message

Testing specific output and non-exported functions:

  • Printed messages and plots are side effects, which can be tested using expect_output
    • expect_output(str(airquality), “41 36 12 18 NA 28 23 19 8 NA”) # passes, since this is part of the output
    • expect_output(str(airquality), “air”) # fails, since the word “air” is not in the output
    • expect_output_file(str(airquality), “airq.txt”, update = TRUE) # creates the file, which will throw an error the first time that it is run
    • expect_output_file(str(airquality), “airq.txt”) # next time run, it will compare the output to the file
  • Can use the library for headers from the roxygen headers
    • expect_equivalent(na_counter(airquality), c(37, 7, 0, 0, 0, 0))
    • expect_equal(simutils:::sum_na(airquality$Ozone), 37) # three colons signal that this is a non-exported function

Grouping and running tests:

  • Can organize and group tests for easier future use - can add many tests in a single test_that() call
    • test_that(“na_counter correctly counts NA values”, { test_matrix = matrix(c(NA, 1, 4, NA, 5, 6), nrow = 2) ; air_expected = c(Ozone = 37, Solar.R = 7, Wind = 0, Temp = 0, Month = 0, Day = 0) ; mat_expected = c(V1 = 1, V2 = 1, V3 = 0) ; expect_equal(na_counter(airquality), air_expected) ; expect_equal(na_counter(test_matrix), mat_expected) })
    • context(“na_counter checks”) # will then give the appropriate information about what failed in the checks
  • Need to fix any failed tests, either by fixing the code or by fixing the test

Wrap up:

  • Sturcture of R packages, including NAMESPACE and DESCRIPTION
  • Documenting R packages and including roxygen2 for examples/documentation
  • Building integrated packages with checks and tests
  • Unit tests to ensure that packages run as expected

Example code includes:

# Set up the test framework
use_testthat("datasummary")

# Look at the contents of the package root directory
dir("datasummary")

# Look at the contents of the new folder which has been created 
dir("datasummary/tests")


# Create a summary of the iris dataset using your data_summary() function
iris_summary <- data_summary(iris)

# Count how many rows are returned
summary_rows <- nrow(iris_summary) 

# Use expect_equal to test that calling data_summary() on iris returns 4 rows
expect_equal(summary_rows, 4)


result <- data_summary(weather)

# Update this test so it passes
expect_equal(result$sd, c(2.1, 3.6), tolerance = 0.1)

expected_result <- list(
    ID = c("Day", "Temp"),
    min = c(1L, 14L),
    median = c(4L, 19L),
    sd = c(2.16024689946929, 3.65148371670111),
    max = c(7L, 24L)
)

# Write a passing test that compares expected_result to result
expect_equivalent(result, expected_result)


# Create a vector containing the numbers 1 through 10
my_vector <- 1:10

# Look at what happens when we apply this vector as an argument to data_summary()
data_summary(my_vector)

# Test if running data_summary() on this vector returns an error
expect_error(data_summary(my_vector))


# Run data_summary on the airquality dataset with na.rm set to FALSE
data_summary(airquality, na.rm=FALSE)

# Use expect_warning to formally test this
expect_warning(data_summary(airquality, na.rm = FALSE))


# Expected result
expected <- data.frame(min = 14L, median = 19L, sd = 3.65148371670111, max = 24L)

# Create variable result by calling numeric summary on the temp column of the weather dataset
result <- datasummary:::numeric_summary(weather$Temp, na.rm = TRUE)

# Test that the value returned matches the expected value
expect_equal(result, expected)


# Use context() and test_that() to group the tests below together
context("Test data_summary()")

test_that("data_summary() handles errors correctly", {

  # Create a vector
  my_vector <- 1:10

  # Use expect_error()
  expect_error(data_summary(my_vector))

  # Use expect_warning()
  expect_warning(data_summary(airquality, na.rm = FALSE))

})


# Run the tests on the datasummary package
test("datasummary")

Factor Analysis in R

Chapter 1 - Evaluating Your Measure with Factor Analysis

Introduction to Exploratory Factor Analysis:

  • Psychometrics is the study of unobservable (“of the mind”) variables
  • Factor analysis is a valuable tool in psychometric analysis
  • Factor analysis is mid-way between SEM (structural equation modeling) and Classical Test Theory
    • Exploratory Factor Analysis (EFA) is used during measure development
    • Confirmatory Factor Analysis (CFA) is used to validate a measure after development
  • This course will use library(psych) and the gcbs dataset on conspiracies
    • EFA_model <- fa(gcbs)
    • fa.diagram(EFA_model)
    • EFA_model$loadings

Overview of the Measure Development Process:

  • The development process includes
    • Develop items foir your measure - start with a larger list than you need
    • Collect pilot data from a representative sample
    • Check what the dataset looks like - psych::describe()
    • Consider whether to run EFA or CFA or both
    • If running both EFA and CFA, split the dataset in to two samples
    • Compare the two samples to make sure they are similar
  • Can use the psych package to view a dataset by the grouping variable
    • gcbs_grouped <- cbind(gcbs, group_var)
    • describeBy(gcbs_grouped, group = group_var) # group= can be a vector
    • statsBy(gcbs_grouped, group = “group_var”) # group= needs to be a name of a column

Measure Features: Correlations and Reliability:

  • Can grab the lower-diagonal correlations using lowerCor(gcbs)
  • Can also get the p-values using corr.test(gcbs, use = “pairwise.complete.obs”)$p # gets the p-values for each correlation
  • Can also get the 95% CI for the correlations using corr.test(gcbs, use = “pairwise.complete.obs”)$ci
  • Can also get the alpha using alpha(gcbs) # measure of the internal consistency or ‘reliability’ of the measure, with a target of 0.8+
  • Can look at the split-half reliability using splitHalf(gcbs)

Example code includes:

# Load the psych package
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:lavaan':
## 
##     cor2cov
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
gcbs <- readRDS("./RInputFiles/GCBS_data.rds")
glimpse(gcbs)
## Observations: 2,495
## Variables: 15
## $ Q1  <int> 5, 5, 2, 5, 5, 1, 4, 5, 1, 1, 4, 5, 5, 5, 5, 4, 4, 2, 4, 5...
## $ Q2  <int> 5, 5, 4, 4, 4, 1, 3, 4, 1, 2, 4, 5, 4, 4, 4, 4, 4, 1, 2, 2...
## $ Q3  <int> 3, 5, 1, 1, 1, 1, 3, 3, 1, 1, 4, 1, 2, 4, 5, 2, 1, 1, 1, 1...
## $ Q4  <int> 5, 5, 2, 2, 4, 1, 3, 3, 1, 1, 5, 5, 4, 5, 5, 5, 3, 1, 1, 3...
## $ Q5  <int> 5, 5, 2, 4, 4, 1, 4, 4, 1, 1, 5, 5, 5, 4, 5, 5, 4, 1, 1, 4...
## $ Q6  <int> 5, 3, 2, 5, 5, 1, 3, 5, 1, 5, 5, 5, 5, 5, 5, 4, 2, 1, 1, 3...
## $ Q7  <int> 5, 5, 4, 4, 4, 1, 3, 5, 1, 1, 4, 3, 3, 5, 5, 5, 4, 1, 1, 2...
## $ Q8  <int> 3, 5, 2, 1, 3, 1, 4, 5, 1, 1, 4, 1, 3, 5, 5, 1, 1, 1, 1, 3...
## $ Q9  <int> 4, 1, 2, 4, 1, 1, 2, 5, 1, 1, 2, 1, 5, 5, 5, 3, 1, 1, 1, 4...
## $ Q10 <int> 5, 4, 4, 5, 5, 1, 3, 5, 1, 4, 5, 5, 3, 5, 5, 5, 4, 2, 3, 4...
## $ Q11 <int> 5, 4, 2, 5, 5, 1, 3, 5, 1, 1, 4, 5, 4, 5, 4, 4, 4, 2, 2, 4...
## $ Q12 <int> 5, 5, 4, 5, 5, 1, 2, 5, 1, 1, 2, 5, 3, 5, 3, 5, 1, 1, 2, 2...
## $ Q13 <int> 3, 4, 0, 1, 3, 1, 2, 3, 1, 1, 2, 1, 3, 4, 5, 4, 1, 1, 1, 4...
## $ Q14 <int> 5, 4, 2, 4, 5, 1, 3, 4, 1, 1, 1, 5, 3, 4, 5, 5, 4, 1, 2, 4...
## $ Q15 <int> 5, 5, 4, 5, 5, 1, 4, 5, 1, 5, 5, 5, 5, 5, 5, 5, 4, 2, 3, 4...
# Conduct a single-factor EFA
EFA_model <- fa(gcbs)

# View the results
EFA_model
## Factor Analysis using method =  minres
## Call: fa(r = gcbs)
## Standardized loadings (pattern matrix) based upon correlation matrix
##      MR1   h2   u2 com
## Q1  0.70 0.49 0.51   1
## Q2  0.72 0.52 0.48   1
## Q3  0.64 0.41 0.59   1
## Q4  0.77 0.59 0.41   1
## Q5  0.67 0.45 0.55   1
## Q6  0.75 0.56 0.44   1
## Q7  0.73 0.54 0.46   1
## Q8  0.65 0.43 0.57   1
## Q9  0.70 0.48 0.52   1
## Q10 0.56 0.32 0.68   1
## Q11 0.72 0.52 0.48   1
## Q12 0.79 0.62 0.38   1
## Q13 0.68 0.46 0.54   1
## Q14 0.74 0.55 0.45   1
## Q15 0.57 0.33 0.67   1
## 
##                 MR1
## SS loadings    7.27
## Proportion Var 0.48
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 factor is sufficient.
## 
## The degrees of freedom for the null model are  105  and the objective function was  9.31 with Chi Square of  23173.8
## The degrees of freedom for the model are 90  and the objective function was  1.93 
## 
## The root mean square of the residuals (RMSR) is  0.08 
## The df corrected root mean square of the residuals is  0.09 
## 
## The harmonic number of observations is  2495 with the empirical chi square  3398.99  with prob <  0 
## The total number of observations was  2495  with Likelihood Chi Square =  4809.34  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.761
## RMSEA index =  0.145  and the 90 % confidence intervals are  0.142 0.149
## BIC =  4105.36
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy             
##                                                    MR1
## Correlation of (regression) scores with factors   0.97
## Multiple R square of scores with factors          0.94
## Minimum correlation of possible factor scores     0.87
# Set up the single-factor EFA
EFA_model <- fa(gcbs)

# View the factor loadings
EFA_model$loadings
## 
## Loadings:
##     MR1  
## Q1  0.703
## Q2  0.719
## Q3  0.638
## Q4  0.770
## Q5  0.672
## Q6  0.746
## Q7  0.734
## Q8  0.654
## Q9  0.695
## Q10 0.565
## Q11 0.719
## Q12 0.786
## Q13 0.679
## Q14 0.743
## Q15 0.574
## 
##                  MR1
## SS loadings    7.267
## Proportion Var 0.484
# Create a path diagram of the items' factor loadings
fa.diagram(EFA_model)

# Take a look at the first few lines of the response data and their corresponding sum scores
head(gcbs)
##   Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15
## 1  5  5  3  5  5  5  5  3  4   5   5   5   3   5   5
## 2  5  5  5  5  5  3  5  5  1   4   4   5   4   4   5
## 3  2  4  1  2  2  2  4  2  2   4   2   4   0   2   4
## 4  5  4  1  2  4  5  4  1  4   5   5   5   1   4   5
## 5  5  4  1  4  4  5  4  3  1   5   5   5   3   5   5
## 6  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1
rowSums(gcbs[1:6, ])
##  1  2  3  4  5  6 
## 68 65 37 55 59 15
# Then look at the first few lines of individuals' factor scores
head(EFA_model$scores)
##             MR1
## [1,]  1.5614675
## [2,]  1.3432026
## [3,] -0.3960355
## [4,]  0.7478868
## [5,]  1.0435203
## [6,] -1.7290812
# To get a feel for how the factor scores are distributed, look at their summary statistics and density plot.
summary(EFA_model$scores)
##       MR1           
##  Min.   :-1.854703  
##  1st Qu.:-0.783260  
##  Median :-0.001971  
##  Mean   : 0.000000  
##  3rd Qu.: 0.728568  
##  Max.   : 1.949580
plot(density(EFA_model$scores, na.rm = TRUE), main = "Factor Scores")

# Basic descriptive statistics
describe(gcbs)
##     vars    n mean   sd median trimmed  mad min max range  skew kurtosis
## Q1     1 2495 3.47 1.46      4    3.59 1.48   0   5     5 -0.55    -1.10
## Q2     2 2495 2.96 1.49      3    2.96 1.48   0   5     5 -0.01    -1.40
## Q3     3 2495 2.05 1.39      1    1.82 0.00   0   5     5  0.98    -0.44
## Q4     4 2495 2.64 1.45      2    2.55 1.48   0   5     5  0.26    -1.34
## Q5     5 2495 3.25 1.47      4    3.32 1.48   0   5     5 -0.35    -1.27
## Q6     6 2495 3.11 1.51      3    3.14 1.48   0   5     5 -0.17    -1.42
## Q7     7 2495 2.67 1.51      2    2.59 1.48   0   5     5  0.28    -1.39
## Q8     8 2495 2.45 1.57      2    2.32 1.48   0   5     5  0.51    -1.30
## Q9     9 2495 2.23 1.42      2    2.05 1.48   0   5     5  0.76    -0.82
## Q10   10 2495 3.50 1.39      4    3.63 1.48   1   5     4 -0.59    -0.94
## Q11   11 2495 3.27 1.40      4    3.34 1.48   0   5     5 -0.35    -1.11
## Q12   12 2495 2.64 1.50      2    2.56 1.48   0   5     5  0.29    -1.37
## Q13   13 2495 2.10 1.38      1    1.89 0.00   0   5     5  0.89    -0.56
## Q14   14 2495 2.96 1.49      3    2.95 1.48   0   5     5 -0.02    -1.43
## Q15   15 2495 4.23 1.10      5    4.47 0.00   0   5     5 -1.56     1.71
##       se
## Q1  0.03
## Q2  0.03
## Q3  0.03
## Q4  0.03
## Q5  0.03
## Q6  0.03
## Q7  0.03
## Q8  0.03
## Q9  0.03
## Q10 0.03
## Q11 0.03
## Q12 0.03
## Q13 0.03
## Q14 0.03
## Q15 0.02
# Graphical representation of error
error.dots(gcbs)

# Graphical representation of error
error.bars(gcbs)

# Establish two sets of indices to split the dataset
N <- nrow(gcbs)
indices <- seq(1, N)
indices_EFA <- sample(indices, floor((.5*N)))
indices_CFA <- indices[!(indices %in% indices_EFA)]

# Use those indices to split the dataset into halves for your EFA and CFA
gcbs_EFA <- gcbs[indices_EFA, ]
gcbs_CFA <- gcbs[indices_CFA, ]


# Use the indices from the previous exercise to create a grouping variable
group_var <- vector("numeric", nrow(gcbs))
group_var[indices_EFA] <- 1
group_var[indices_CFA] <- 2

# Bind that grouping variable onto the gcbs dataset
gcbs_grouped <- cbind(gcbs, group_var)

# Compare stats across groups
describeBy(gcbs_grouped, group = group_var)
## 
##  Descriptive statistics by group 
## group: 1
##           vars    n mean   sd median trimmed  mad min max range  skew
## Q1           1 1247 3.48 1.44      4    3.59 1.48   0   5     5 -0.54
## Q2           2 1247 2.99 1.48      3    2.99 1.48   0   5     5 -0.02
## Q3           3 1247 2.07 1.38      1    1.84 0.00   0   5     5  0.96
## Q4           4 1247 2.62 1.44      2    2.53 1.48   0   5     5  0.26
## Q5           5 1247 3.23 1.47      4    3.30 1.48   0   5     5 -0.33
## Q6           6 1247 3.13 1.50      3    3.16 1.48   0   5     5 -0.22
## Q7           7 1247 2.66 1.51      2    2.58 1.48   0   5     5  0.29
## Q8           8 1247 2.49 1.57      2    2.37 1.48   0   5     5  0.46
## Q9           9 1247 2.21 1.39      2    2.01 1.48   0   5     5  0.79
## Q10         10 1247 3.51 1.40      4    3.63 1.48   1   5     4 -0.59
## Q11         11 1247 3.30 1.38      4    3.38 1.48   0   5     5 -0.39
## Q12         12 1247 2.63 1.51      2    2.54 1.48   0   5     5  0.31
## Q13         13 1247 2.14 1.40      1    1.94 0.00   0   5     5  0.82
## Q14         14 1247 2.95 1.49      3    2.94 1.48   0   5     5 -0.01
## Q15         15 1247 4.24 1.11      5    4.48 0.00   1   5     4 -1.57
## group_var   16 1247 1.00 0.00      1    1.00 0.00   1   1     0   NaN
##           kurtosis   se
## Q1           -1.08 0.04
## Q2           -1.37 0.04
## Q3           -0.46 0.04
## Q4           -1.33 0.04
## Q5           -1.28 0.04
## Q6           -1.40 0.04
## Q7           -1.38 0.04
## Q8           -1.34 0.04
## Q9           -0.76 0.04
## Q10          -0.96 0.04
## Q11          -1.05 0.04
## Q12          -1.36 0.04
## Q13          -0.69 0.04
## Q14          -1.41 0.04
## Q15           1.67 0.03
## group_var      NaN 0.00
## -------------------------------------------------------- 
## group: 2
##           vars    n mean   sd median trimmed  mad min max range  skew
## Q1           1 1248 3.47 1.47      4    3.59 1.48   0   5     5 -0.55
## Q2           2 1248 2.94 1.51      3    2.93 1.48   0   5     5  0.01
## Q3           3 1248 2.03 1.39      1    1.80 0.00   0   5     5  1.00
## Q4           4 1248 2.66 1.46      2    2.57 1.48   0   5     5  0.26
## Q5           5 1248 3.28 1.47      4    3.35 1.48   0   5     5 -0.37
## Q6           6 1248 3.09 1.51      3    3.11 1.48   0   5     5 -0.11
## Q7           7 1248 2.67 1.51      2    2.59 1.48   0   5     5  0.28
## Q8           8 1248 2.41 1.57      2    2.27 1.48   0   5     5  0.57
## Q9           9 1248 2.26 1.45      2    2.08 1.48   0   5     5  0.73
## Q10         10 1248 3.50 1.38      4    3.62 1.48   1   5     4 -0.58
## Q11         11 1248 3.23 1.42      3    3.29 1.48   0   5     5 -0.31
## Q12         12 1248 2.66 1.50      2    2.58 1.48   0   5     5  0.26
## Q13         13 1248 2.06 1.37      1    1.85 0.00   0   5     5  0.95
## Q14         14 1248 2.96 1.49      3    2.95 1.48   0   5     5 -0.03
## Q15         15 1248 4.21 1.10      5    4.45 0.00   0   5     5 -1.56
## group_var   16 1248 2.00 0.00      2    2.00 0.00   2   2     0   NaN
##           kurtosis   se
## Q1           -1.13 0.04
## Q2           -1.43 0.04
## Q3           -0.43 0.04
## Q4           -1.35 0.04
## Q5           -1.26 0.04
## Q6           -1.44 0.04
## Q7           -1.40 0.04
## Q8           -1.25 0.04
## Q9           -0.89 0.04
## Q10          -0.93 0.04
## Q11          -1.17 0.04
## Q12          -1.37 0.04
## Q13          -0.42 0.04
## Q14          -1.45 0.04
## Q15           1.74 0.03
## group_var      NaN 0.00
statsBy(gcbs_grouped, group = "group_var")
## Statistics within and between groups  
## Call: statsBy(data = gcbs_grouped, group = "group_var")
## Intraclass Correlation 1 (Percentage of variance due to groups) 
##        Q1        Q2        Q3        Q4        Q5        Q6        Q7 
##         0         0         0         0         0         0         0 
##        Q8        Q9       Q10       Q11       Q12       Q13       Q14 
##         0         0         0         0         0         0         0 
##       Q15 group_var 
##         0         1 
## Intraclass Correlation 2 (Reliability of group differences) 
##        Q1        Q2        Q3        Q4        Q5        Q6        Q7 
##    -93.70     -0.60     -1.00     -1.07     -0.87     -1.35   -104.83 
##        Q8        Q9       Q10       Q11       Q12       Q13       Q14 
##      0.48     -0.36    -84.52      0.40     -2.47      0.49   -337.47 
##       Q15 group_var 
##     -1.93      1.00 
## eta^2 between groups  
##  Q1.bg  Q2.bg  Q3.bg  Q4.bg  Q5.bg  Q6.bg  Q7.bg  Q8.bg  Q9.bg Q10.bg 
##      0      0      0      0      0      0      0      0      0      0 
## Q11.bg Q12.bg Q13.bg Q14.bg Q15.bg 
##      0      0      0      0      0 
## 
## To see the correlations between and within groups, use the short=FALSE option in your print statement.
## Many results are not shown directly. To see specific objects select from the following list:
##  mean sd n F ICC1 ICC2 ci1 ci2 raw rbg pbg rwg nw pwg etabg etawg nwg nG Call
# Take a look at some correlation data
lowerCor(gcbs, use = "pairwise.complete.obs")
##     Q1   Q2   Q3   Q4   Q5   Q6   Q7   Q8   Q9   Q10 
## Q1  1.00                                             
## Q2  0.53 1.00                                        
## Q3  0.36 0.40 1.00                                   
## Q4  0.52 0.53 0.50 1.00                              
## Q5  0.48 0.46 0.40 0.57 1.00                         
## Q6  0.63 0.55 0.40 0.61 0.50 1.00                    
## Q7  0.47 0.67 0.42 0.57 0.45 0.54 1.00               
## Q8  0.39 0.38 0.78 0.49 0.41 0.41 0.41 1.00          
## Q9  0.42 0.49 0.49 0.56 0.46 0.48 0.53 0.48 1.00     
## Q10 0.44 0.38 0.32 0.40 0.43 0.41 0.39 0.36 0.37 1.00
## Q11 0.64 0.52 0.34 0.52 0.49 0.62 0.49 0.37 0.46 0.45
## Q12 0.52 0.72 0.44 0.60 0.49 0.59 0.75 0.42 0.57 0.40
## Q13 0.38 0.40 0.71 0.51 0.43 0.42 0.45 0.76 0.54 0.37
## Q14 0.53 0.50 0.43 0.60 0.54 0.55 0.52 0.45 0.55 0.41
## Q15 0.51 0.40 0.27 0.39 0.45 0.47 0.39 0.31 0.32 0.45
##     Q11  Q12  Q13  Q14  Q15 
## Q11 1.00                    
## Q12 0.55 1.00               
## Q13 0.40 0.49 1.00          
## Q14 0.56 0.56 0.50 1.00     
## Q15 0.54 0.41 0.30 0.46 1.00
# Take a look at some correlation data
corr.test(gcbs, use = "pairwise.complete.obs")$p
##                Q1            Q2            Q3            Q4            Q5
## Q1   0.000000e+00 1.038105e-175  2.525793e-74 1.746323e-174 5.801103e-143
## Q2  1.384140e-177  0.000000e+00  6.236650e-93 8.388030e-183 1.237758e-127
## Q3   3.608276e-75  3.282447e-94  0.000000e+00 1.087371e-155  1.282718e-94
## Q4  2.359896e-176 1.075388e-184 1.647531e-157  0.000000e+00 7.309882e-214
## Q5  1.054746e-144 2.578663e-129  6.108182e-96 8.032837e-216  0.000000e+00
## Q6  1.477903e-277 1.757495e-198  1.248831e-96 3.037754e-253 1.470071e-154
## Q7  8.142449e-139  0.000000e+00 5.318107e-109 5.260693e-212 2.336455e-122
## Q8   1.549786e-91  5.579984e-85  0.000000e+00 1.898345e-150 1.224898e-101
## Q9  4.344797e-106 3.376084e-148 3.104324e-151 3.108940e-210 1.462606e-133
## Q10 3.550942e-116  9.131376e-87  6.760094e-61  1.298818e-98 1.476449e-111
## Q11 9.499179e-292 1.822105e-173  2.580298e-69 9.037626e-175 2.540009e-153
## Q12 9.097129e-175  0.000000e+00 1.005698e-117 1.550592e-244 2.612164e-152
## Q13  9.052542e-87  1.297778e-96  0.000000e+00 1.798859e-167 1.168044e-111
## Q14 1.912163e-184 3.969981e-158 1.103607e-113 8.219433e-248 4.008639e-188
## Q15 9.793389e-162  3.129498e-96  1.676489e-41  1.255385e-93 3.849304e-125
##                Q6            Q7            Q8            Q9           Q10
## Q1  1.448345e-275 4.315498e-137  2.634636e-90 1.433783e-104 1.420377e-114
## Q2  1.493871e-196  0.000000e+00  6.695981e-84 1.958129e-146  1.267356e-85
## Q3   2.997195e-95 1.861338e-107  0.000000e+00 1.862595e-149  2.704037e-60
## Q4  2.916244e-251 4.734624e-210 1.120024e-148 2.766957e-208  3.376926e-97
## Q5  9.408454e-153 9.813111e-121 3.429715e-100 7.459290e-132 5.462862e-110
## Q6   0.000000e+00 2.582592e-187 1.111480e-101 4.903141e-145  2.966233e-99
## Q7  3.149502e-189  0.000000e+00 1.110564e-101 4.177923e-178  1.437786e-89
## Q8  3.704932e-103 3.582463e-103  0.000000e+00 3.981155e-142  1.778530e-78
## Q9  8.755609e-147 5.497267e-180 7.372509e-144  0.000000e+00  2.784487e-79
## Q10 1.098605e-100  9.585238e-91  2.223163e-79  3.093874e-80  0.000000e+00
## Q11 4.981130e-268 8.601767e-153  2.643708e-82 1.655959e-131 4.633595e-125
## Q12 2.403960e-231  0.000000e+00 5.764588e-109 7.071640e-216  1.561964e-98
## Q13 1.353745e-109 7.858024e-124  0.000000e+00 2.581509e-186  4.747674e-81
## Q14 7.044729e-195 6.545724e-170 2.485870e-124 2.858794e-201 3.168647e-104
## Q15 1.340087e-134  7.843219e-91  4.328489e-55  4.516727e-61 2.157366e-124
##               Q11           Q12           Q13           Q14           Q15
## Q1  9.404187e-290 6.597467e-173  1.267356e-85 1.472366e-182 6.659505e-160
## Q2  1.293694e-171  0.000000e+00  2.997195e-95 2.659887e-156  6.884895e-95
## Q3   1.548179e-68 4.123361e-116  0.000000e+00 4.304066e-112  1.676489e-41
## Q4  6.597467e-173 1.457557e-242 1.241212e-165 7.808462e-246  2.259693e-92
## Q5  1.600206e-151 1.593420e-150 4.438568e-110 3.206912e-186 1.809173e-123
## Q6  4.831696e-266 2.235683e-229 4.873482e-108 5.847125e-193 6.968453e-133
## Q7  5.333096e-151  0.000000e+00 3.378951e-122 4.582007e-168  1.254915e-89
## Q8   2.908079e-81 1.959960e-107  0.000000e+00 1.093783e-122  1.298547e-54
## Q9  8.114201e-130 6.505909e-214 2.039392e-184 2.458563e-199  2.258363e-60
## Q10 2.131454e-123  3.904910e-97  4.747674e-80 1.013967e-102 9.708145e-123
## Q11  0.000000e+00 1.571210e-193  1.584049e-94 1.610933e-205 1.319712e-186
## Q12 1.870488e-195  0.000000e+00 9.395071e-146 1.350081e-206 3.398356e-100
## Q13  7.920246e-96 1.648258e-147  0.000000e+00 6.585550e-154  2.751475e-52
## Q14 1.851647e-207 1.534183e-208 1.013162e-155  0.000000e+00 2.883420e-130
## Q15 1.629274e-188 1.171847e-101  1.375737e-52 5.766839e-132  0.000000e+00
# Take a look at some correlation data
corr.test(gcbs, use = "pairwise.complete.obs")$ci
##             lower         r     upper             p
## Q1-Q2   0.4970162 0.5259992 0.5538098 1.384140e-177
## Q1-Q3   0.3206223 0.3553928 0.3892067  3.608276e-75
## Q1-Q4   0.4953852 0.5244323 0.5523079 2.359896e-176
## Q1-Q5   0.4503342 0.4810747 0.5106759 1.054746e-144
## Q1-Q6   0.6071117 0.6313131 0.6543444 1.477903e-277
## Q1-Q7   0.4412058 0.4722710 0.5022057 8.142449e-139
## Q1-Q8   0.3564216 0.3902059 0.4229712  1.549786e-91
## Q1-Q9   0.3850453 0.4179718 0.4498355 4.344797e-106
## Q1-Q10  0.4034438 0.4357865 0.4670415 3.550942e-116
## Q1-Q11  0.6199265 0.6435136 0.6659388 9.499179e-292
## Q1-Q12  0.4932727 0.5224025 0.5503620 9.097129e-175
## Q1-Q13  0.3464313 0.3805006 0.4135673  9.052542e-87
## Q1-Q14  0.5059498 0.5345780 0.5620298 1.912163e-184
## Q1-Q15  0.4753633 0.5051815 0.5338405 9.793389e-162
## Q2-Q3   0.3618855 0.3955108 0.4281083  3.282447e-94
## Q2-Q4   0.5062706 0.5348860 0.5623248 1.075388e-184
## Q2-Q5   0.4259018 0.4574975 0.4879788 2.578663e-129
## Q2-Q6   0.5234810 0.5513960 0.5781285 1.757495e-198
## Q2-Q7   0.6501266 0.6722188 0.6931753  0.000000e+00
## Q2-Q8   0.3425926 0.3767693 0.4099501  5.579984e-85
## Q2-Q9   0.4556319 0.4861810 0.5155863 3.376084e-148
## Q2-Q10  0.3464233 0.3804928 0.4135598  9.131376e-87
## Q2-Q11  0.4915283 0.5207263 0.5487548 1.822105e-173
## Q2-Q12  0.6962013 0.7158851 0.7344931  0.000000e+00
## Q2-Q13  0.3667134 0.4001964 0.4326439  1.297778e-96
## Q2-Q14  0.4702225 0.5002339 0.5290898 3.969981e-158
## Q2-Q15  0.3659505 0.3994560 0.4319274  3.129498e-96
## Q3-Q4   0.4693335 0.4993781 0.5282679 1.647531e-157
## Q3-Q5   0.3653695 0.3988923 0.4313817  6.108182e-96
## Q3-Q6   0.3667467 0.4002287 0.4326752  1.248831e-96
## Q3-Q7   0.3904688 0.4232258 0.4549125 5.318107e-109
## Q3-Q8   0.7683496 0.7839542 0.7986273  0.000000e+00
## Q3-Q9   0.4601647 0.4905484 0.5197845 3.104324e-151
## Q3-Q10  0.2853436 0.3209913 0.3557521  6.760094e-61
## Q3-Q11  0.3066780 0.3418064 0.3760050  2.580298e-69
## Q3-Q12  0.4061739 0.4384278 0.4695906 1.005698e-117
## Q3-Q13  0.6919756 0.7118867 0.7307155  0.000000e+00
## Q3-Q14  0.3989973 0.4314834 0.4628876 1.103607e-113
## Q3-Q15  0.2285790 0.2654400 0.3015410  1.676489e-41
## Q4-Q5   0.5438704 0.5709273 0.5967985 8.032837e-216
## Q4-Q6   0.5837641 0.6090539 0.6331630 3.037754e-253
## Q4-Q7   0.5394959 0.5667395 0.5927977 5.260693e-212
## Q4-Q8   0.4589969 0.4894234 0.5187032 1.898345e-150
## Q4-Q9   0.5374441 0.5647747 0.5909202 3.108940e-210
## Q4-Q10  0.3706739 0.4040387 0.4363621  1.298818e-98
## Q4-Q11  0.4932765 0.5224062 0.5503655 9.037626e-175
## Q4-Q12  0.5749365 0.6006273 0.6251350 1.550592e-244
## Q4-Q13  0.4833700 0.5128834 0.5412322 1.798859e-167
## Q4-Q14  0.5782876 0.6038268 0.6281838 8.219433e-248
## Q4-Q15  0.3607035 0.3943633 0.4269973  1.255385e-93
## Q5-Q6   0.4650551 0.4952588 0.5243108 1.470071e-154
## Q5-Q7   0.4142088 0.4461981 0.4770864 2.336455e-122
## Q5-Q8   0.3765708 0.4097576 0.4418941 1.224898e-101
## Q5-Q9   0.4328328 0.4641904 0.4944261 1.462606e-133
## Q5-Q10  0.3951535 0.4277623 0.4592945 1.476449e-111
## Q5-Q11  0.4632435 0.4935141 0.5226345 2.540009e-153
## Q5-Q12  0.4617541 0.4920795 0.5212560 2.612164e-152
## Q5-Q13  0.3953385 0.4279414 0.4594675 1.168044e-111
## Q5-Q14  0.5106389 0.5390785 0.5663399 4.008639e-188
## Q5-Q15  0.4189383 0.4507697 0.4814945 3.849304e-125
## Q6-Q7   0.5120337 0.5404170 0.5676214 3.149502e-189
## Q6-Q8   0.3794902 0.4125879 0.4446310 3.704932e-103
## Q6-Q9   0.4534992 0.4841255 0.5136099 8.755609e-147
## Q6-Q10  0.3747259 0.4079687 0.4401639 1.098605e-100
## Q6-Q11  0.5981808 0.6228032 0.6462508 4.981130e-268
## Q6-Q12  0.5610551 0.5873651 0.6124896 2.403960e-231
## Q6-Q13  0.3915639 0.4242864 0.4559371 1.353745e-109
## Q6-Q14  0.5190730 0.5471694 0.5740847 7.044729e-195
## Q6-Q15  0.4345044 0.4658040 0.4959800 1.340087e-134
## Q7-Q8   0.3795181 0.4126150 0.4446571 3.582463e-103
## Q7-Q9   0.5001718 0.5290301 0.5567146 5.497267e-180
## Q7-Q10  0.3547857 0.3886172 0.4214323  9.585238e-91
## Q7-Q11  0.4624648 0.4927641 0.5219138 8.601767e-153
## Q7-Q12  0.7365820 0.7540288 0.7704729  0.000000e+00
## Q7-Q13  0.4167211 0.4486267 0.4794284 7.858024e-124
## Q7-Q14  0.4867147 0.5160994 0.5443174 6.545724e-170
## Q7-Q15  0.3549662 0.3887925 0.4216021  7.843219e-91
## Q8-Q9   0.4490408 0.4798277 0.5094765 7.372509e-144
## Q8-Q10  0.3302521 0.3647668 0.3983073  2.223163e-79
## Q8-Q11  0.3367608 0.3710987 0.4044508  2.643708e-82
## Q8-Q12  0.3904041 0.4231631 0.4548520 5.764588e-109
## Q8-Q13  0.7398147 0.7570774 0.7733440  0.000000e+00
## Q8-Q14  0.4175690 0.4494462 0.4802185 2.485870e-124
## Q8-Q15  0.2696028 0.3056115 0.3407668  4.328489e-55
## Q9-Q10  0.3321729 0.3666358 0.4001210  3.093874e-80
## Q9-Q11  0.4294991 0.4609717 0.4913259 1.655959e-131
## Q9-Q12  0.5439334 0.5709876 0.5968561 7.071640e-216
## Q9-Q13  0.5083417 0.5368740 0.5642288 2.581509e-186
## Q9-Q14  0.5268510 0.5546263 0.5812182 2.858794e-201
## Q9-Q15  0.2858045 0.3214413 0.3561903  4.516727e-61
## Q10-Q11 0.4188025 0.4506384 0.4813679 4.633595e-125
## Q10-Q12 0.3705162 0.4038857 0.4362140  1.561964e-98
## Q10-Q13 0.3339871 0.3684007 0.4018335  4.747674e-81
## Q10-Q14 0.3815258 0.4145611 0.4465387 3.168647e-104
## Q10-Q15 0.4176732 0.4495470 0.4803157 2.157366e-124
## Q11-Q12 0.5197817 0.5478491 0.5747350 1.870488e-195
## Q11-Q13 0.3651436 0.3986730 0.4311694  7.920246e-96
## Q11-Q14 0.5342028 0.5616704 0.5879532 1.851647e-207
## Q11-Q15 0.5111332 0.5395529 0.5667941 1.629274e-188
## Q12-Q13 0.4545949 0.4851817 0.5146255 1.648258e-147
## Q12-Q14 0.5354702 0.5628844 0.5891136 1.534183e-208
## Q12-Q15 0.3766079 0.4097936 0.4419289 1.171847e-101
## Q13-Q14 0.4667464 0.4968874 0.5258754 1.013162e-155
## Q13-Q15 0.2625236 0.2986885 0.3340155  1.375737e-52
## Q14-Q15 0.4302457 0.4616926 0.4920203 5.766839e-132
# Estimate coefficient alpha
alpha(gcbs)
## 
## Reliability analysis   
## Call: alpha(x = gcbs)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean sd median_r
##       0.93      0.93    0.94      0.48  14 0.002  2.9  1     0.47
## 
##  lower alpha upper     95% confidence boundaries
## 0.93 0.93 0.94 
## 
##  Reliability if an item is dropped:
##     raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## Q1       0.93      0.93    0.94      0.48  13   0.0021 0.0105  0.46
## Q2       0.93      0.93    0.94      0.48  13   0.0021 0.0099  0.47
## Q3       0.93      0.93    0.94      0.49  13   0.0020 0.0084  0.48
## Q4       0.93      0.93    0.94      0.47  13   0.0022 0.0105  0.46
## Q5       0.93      0.93    0.94      0.48  13   0.0021 0.0112  0.48
## Q6       0.93      0.93    0.94      0.48  13   0.0021 0.0104  0.46
## Q7       0.93      0.93    0.94      0.48  13   0.0021 0.0098  0.47
## Q8       0.93      0.93    0.94      0.48  13   0.0020 0.0086  0.49
## Q9       0.93      0.93    0.94      0.48  13   0.0021 0.0108  0.46
## Q10      0.93      0.93    0.94      0.49  14   0.0020 0.0102  0.49
## Q11      0.93      0.93    0.94      0.48  13   0.0021 0.0104  0.46
## Q12      0.93      0.93    0.94      0.47  13   0.0022 0.0093  0.46
## Q13      0.93      0.93    0.94      0.48  13   0.0021 0.0092  0.48
## Q14      0.93      0.93    0.94      0.48  13   0.0021 0.0109  0.46
## Q15      0.93      0.93    0.94      0.49  14   0.0020 0.0095  0.49
## 
##  Item statistics 
##        n raw.r std.r r.cor r.drop mean  sd
## Q1  2495  0.73  0.73  0.70   0.68  3.5 1.5
## Q2  2495  0.74  0.74  0.72   0.69  3.0 1.5
## Q3  2495  0.68  0.67  0.66   0.62  2.0 1.4
## Q4  2495  0.78  0.78  0.76   0.74  2.6 1.5
## Q5  2495  0.70  0.70  0.67   0.65  3.3 1.5
## Q6  2495  0.76  0.76  0.74   0.72  3.1 1.5
## Q7  2495  0.75  0.75  0.73   0.70  2.7 1.5
## Q8  2495  0.69  0.69  0.68   0.63  2.5 1.6
## Q9  2495  0.72  0.72  0.69   0.67  2.2 1.4
## Q10 2495  0.61  0.61  0.57   0.55  3.5 1.4
## Q11 2495  0.74  0.74  0.72   0.69  3.3 1.4
## Q12 2495  0.79  0.79  0.79   0.75  2.6 1.5
## Q13 2495  0.71  0.71  0.70   0.66  2.1 1.4
## Q14 2495  0.76  0.76  0.74   0.71  3.0 1.5
## Q15 2495  0.60  0.62  0.58   0.56  4.2 1.1
## 
## Non missing response frequency for each item
##        0    1    2    3    4    5 miss
## Q1  0.00 0.16 0.12 0.12 0.27 0.32    0
## Q2  0.01 0.23 0.19 0.16 0.20 0.22    0
## Q3  0.00 0.55 0.13 0.12 0.10 0.10    0
## Q4  0.00 0.32 0.18 0.15 0.20 0.14    0
## Q5  0.00 0.19 0.14 0.13 0.28 0.26    0
## Q6  0.00 0.23 0.15 0.15 0.23 0.24    0
## Q7  0.00 0.33 0.19 0.13 0.18 0.17    0
## Q8  0.00 0.44 0.12 0.14 0.12 0.18    0
## Q9  0.00 0.45 0.19 0.12 0.12 0.11    0
## Q10 0.00 0.14 0.12 0.14 0.30 0.30    0
## Q11 0.00 0.16 0.14 0.19 0.27 0.24    0
## Q12 0.00 0.34 0.18 0.15 0.17 0.17    0
## Q13 0.01 0.51 0.15 0.15 0.10 0.09    0
## Q14 0.00 0.25 0.17 0.15 0.22 0.20    0
## Q15 0.00 0.05 0.05 0.08 0.27 0.55    0
# Calculate split-half reliability
splitHalf(gcbs)
## Split half reliabilities  
## Call: splitHalf(r = gcbs)
## 
## Maximum split half reliability (lambda 4) =  0.95
## Guttman lambda 6                          =  0.94
## Average split half reliability            =  0.93
## Guttman lambda 3 (alpha)                  =  0.93
## Minimum split half reliability  (beta)    =  0.86
## Average interitem r =  0.48  with median =  0.47

Chapter 2 - Multidimensional EFA

Determining dimensionality:

  • Can use factor analysis to find the “true” number of dimensions being reflected in the data
  • Can use the bfi dataset (“Big Five” personality trait dataset)
    • Six point scale with 1 being very inaccurate and 6 being very accurate
  • Suppose that you do not have a theory underlying the data and instead want to use an empirical approach with eigenvalues
    • bfi_EFA_cor <- cor(bfi_EFA, use = “pairwise.complete.obs”)
    • eigenvals <- eigen(bfi_EFA_cor)
    • eigenvals$values
    • scree(bfi_EFA_cor, factors = FALSE) # eigenvalues greater than 1 are typically the best to use

Understanding multidimensional data:

  • Theory and empirical data may lead to different outcomes - constructs for psychology are an example
  • Factors are the mathematical counterpart of a theoretical construct
    • How well does the hypothesis fit with the data?
  • Can instead run exploratory analysis to try to come up with factors
    • Lack of theoretical grounding can make interpretation of the results complicated
  • Can run the multidimensional analysis using any number of factors
    • EFA_model <- fa(bfi_EFA, nfactors = 6)
    • EFA_model$loadings
    • head(EFA_model$scores)

Investigating model fit:

  • Can look at absolute fit (adequate fit with typical ranges and cutoff values) and relative fit (no set ranges, used mainly for nested models from the same dataset)
  • Goal is to have a non-significant chi-squared test, though that is rare for large datasets
  • The TLI (Tucker-Lewis) should be 0.90+
  • The RMSEA (RMSE approximation) should be 0.05-
  • The BIC is a relative-fit statistic, so it is meaningful only to compare across models (lower BIC is better)
    • bfi_theory <- fa(bfi_EFA, nfactors = 5)
    • bfi_eigen <- fa(bfi_EFA, nfactors = 6)
    • bfi_theory$BIC
    • bfi_eigen$BIC

Example code includes:

data(bfi, package="psych")
glimpse(bfi)
## Observations: 2,800
## Variables: 28
## $ A1        <int> 2, 2, 5, 4, 2, 6, 2, 4, 4, 2, 4, 2, 5, 5, 4, 4, 4, 5...
## $ A2        <int> 4, 4, 4, 4, 3, 6, 5, 3, 3, 5, 4, 5, 5, 5, 5, 3, 6, 5...
## $ A3        <int> 3, 5, 5, 6, 3, 5, 5, 1, 6, 6, 5, 5, 5, 5, 2, 6, 6, 5...
## $ A4        <int> 4, 2, 4, 5, 4, 6, 3, 5, 3, 6, 6, 5, 6, 6, 2, 6, 2, 4...
## $ A5        <int> 4, 5, 4, 5, 5, 5, 5, 1, 3, 5, 5, 5, 4, 6, 1, 3, 5, 5...
## $ C1        <int> 2, 5, 4, 4, 4, 6, 5, 3, 6, 6, 4, 5, 5, 4, 5, 5, 4, 5...
## $ C2        <int> 3, 4, 5, 4, 4, 6, 4, 2, 6, 5, 3, 4, 4, 4, 5, 5, 4, 5...
## $ C3        <int> 3, 4, 4, 3, 5, 6, 4, 4, 3, 6, 5, 5, 3, 4, 5, 5, 4, 5...
## $ C4        <int> 4, 3, 2, 5, 3, 1, 2, 2, 4, 2, 3, 4, 2, 2, 2, 3, 4, 4...
## $ C5        <int> 4, 4, 5, 5, 2, 3, 3, 4, 5, 1, 2, 5, 2, 1, 2, 5, 4, 3...
## $ E1        <int> 3, 1, 2, 5, 2, 2, 4, 3, 5, 2, 1, 3, 3, 2, 3, 1, 1, 2...
## $ E2        <int> 3, 1, 4, 3, 2, 1, 3, 6, 3, 2, 3, 3, 3, 2, 4, 1, 2, 2...
## $ E3        <int> 3, 6, 4, 4, 5, 6, 4, 4, NA, 4, 2, 4, 3, 4, 3, 6, 5, ...
## $ E4        <int> 4, 4, 4, 4, 4, 5, 5, 2, 4, 5, 5, 5, 2, 6, 6, 6, 5, 6...
## $ E5        <int> 4, 3, 5, 4, 5, 6, 5, 1, 3, 5, 4, 4, 4, 5, 5, 4, 5, 6...
## $ N1        <int> 3, 3, 4, 2, 2, 3, 1, 6, 5, 5, 3, 4, 1, 1, 2, 4, 4, 6...
## $ N2        <int> 4, 3, 5, 5, 3, 5, 2, 3, 5, 5, 3, 5, 2, 1, 4, 5, 4, 5...
## $ N3        <int> 2, 3, 4, 2, 4, 2, 2, 2, 2, 5, 4, 3, 2, 1, 2, 4, 4, 5...
## $ N4        <int> 2, 5, 2, 4, 4, 2, 1, 6, 3, 2, 2, 2, 2, 2, 2, 5, 4, 4...
## $ N5        <int> 3, 5, 3, 1, 3, 3, 1, 4, 3, 4, 3, NA, 2, 1, 3, 5, 5, ...
## $ O1        <int> 3, 4, 4, 3, 3, 4, 5, 3, 6, 5, 5, 4, 4, 5, 5, 6, 5, 5...
## $ O2        <int> 6, 2, 2, 3, 3, 3, 2, 2, 6, 1, 3, 6, 2, 3, 2, 6, 1, 1...
## $ O3        <int> 3, 4, 5, 4, 4, 5, 5, 4, 6, 5, 5, 4, 4, 4, 5, 6, 5, 4...
## $ O4        <int> 4, 3, 5, 3, 3, 6, 6, 5, 6, 5, 6, 5, 5, 4, 5, 3, 6, 5...
## $ O5        <int> 3, 3, 2, 5, 3, 1, 1, 3, 1, 2, 3, 4, 2, 4, 5, 2, 3, 4...
## $ gender    <int> 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1...
## $ education <int> NA, NA, NA, NA, NA, 3, NA, 2, 1, NA, 1, NA, NA, NA, ...
## $ age       <int> 16, 18, 17, 17, 17, 21, 18, 19, 19, 17, 21, 16, 16, ...
# Establish two sets of indices to split the dataset
N <- nrow(bfi)
indices <- seq(1, N)
indices_EFA <- sample(indices, floor((.5*N)))
indices_CFA <- indices[!(indices %in% indices_EFA)]

# Use those indices to split the dataset into halves for your EFA and CFA
bfi_EFA <- bfi[indices_EFA, ]
bfi_CFA <- bfi[indices_CFA, ]


# Calculate the correlation matrix first
bfi_EFA_cor <- cor(bfi_EFA, use = "pairwise.complete.obs")

# Then use that correlation matrix to calculate eigenvalues
eigenvals <- eigen(bfi_EFA_cor)

# Look at the eigenvalues returned
eigenvals$values
##  [1] 5.0300529 2.8855687 2.1326114 1.8423851 1.5839370 1.3565596 1.1281967
##  [8] 0.8876166 0.8368945 0.7759991 0.7411295 0.7223954 0.6965653 0.6688663
## [15] 0.6495947 0.6406188 0.5569057 0.5550641 0.5408830 0.5243398 0.4913524
## [22] 0.4769243 0.4467000 0.4206217 0.4021440 0.3781987 0.3609526 0.2669220
# Then use that correlation matrix to create the scree plot
scree(bfi_EFA_cor, factors = FALSE)

# Run the EFA with six factors (as indicated by your scree plot)
EFA_model <- fa(bfi_EFA, nfactors=6)
## Loading required namespace: GPArotation
# View results from the model object
EFA_model
## Factor Analysis using method =  minres
## Call: fa(r = bfi_EFA, nfactors = 6)
## Standardized loadings (pattern matrix) based upon correlation matrix
##             MR2   MR5   MR3   MR1   MR4   MR6    h2   u2 com
## A1         0.07 -0.39  0.05 -0.03  0.02  0.51 0.401 0.60 1.9
## A2         0.04  0.60  0.09 -0.04  0.05 -0.19 0.445 0.56 1.3
## A3        -0.04  0.65  0.03 -0.06  0.05 -0.01 0.484 0.52 1.0
## A4        -0.07  0.44  0.19 -0.06 -0.12  0.02 0.267 0.73 1.6
## A5        -0.17  0.59  0.01 -0.12  0.10  0.08 0.494 0.51 1.4
## C1         0.04  0.00  0.54  0.12  0.20  0.10 0.367 0.63 1.4
## C2         0.06  0.12  0.59  0.21  0.11  0.17 0.440 0.56 1.6
## C3         0.01  0.11  0.54  0.12  0.01  0.09 0.316 0.68 1.2
## C4         0.07  0.05 -0.69  0.11  0.04  0.19 0.553 0.45 1.2
## C5         0.11  0.00 -0.57  0.19  0.10  0.01 0.430 0.57 1.4
## E1        -0.13 -0.14  0.07  0.58 -0.10  0.07 0.395 0.61 1.3
## E2         0.07 -0.09 -0.03  0.69 -0.10  0.03 0.593 0.41 1.1
## E3         0.06  0.26 -0.02 -0.30  0.41  0.13 0.471 0.53 2.9
## E4        -0.04  0.38  0.01 -0.48  0.01  0.26 0.567 0.43 2.5
## E5         0.15  0.07  0.24 -0.35  0.31  0.03 0.398 0.60 3.3
## N1         0.80 -0.10 -0.01 -0.09 -0.03  0.04 0.641 0.36 1.1
## N2         0.83 -0.10  0.02 -0.08  0.01 -0.03 0.661 0.34 1.1
## N3         0.70  0.10 -0.07  0.12  0.04  0.02 0.568 0.43 1.1
## N4         0.43  0.08 -0.15  0.38  0.05 -0.04 0.458 0.54 2.3
## N5         0.53  0.22  0.01  0.21 -0.15  0.04 0.419 0.58 1.9
## O1        -0.05 -0.01  0.04  0.00  0.58  0.08 0.358 0.64 1.1
## O2         0.13  0.19 -0.10  0.01 -0.38  0.24 0.254 0.75 2.7
## O3        -0.01  0.06  0.01 -0.07  0.67  0.02 0.507 0.49 1.0
## O4         0.11  0.17 -0.03  0.39  0.35 -0.06 0.279 0.72 2.6
## O5         0.05  0.10 -0.05 -0.01 -0.46  0.28 0.293 0.71 1.8
## gender     0.24  0.23  0.14 -0.11 -0.17 -0.17 0.154 0.85 4.8
## education  0.00 -0.08  0.04 -0.01  0.09 -0.21 0.057 0.94 1.8
## age       -0.03  0.04  0.10 -0.11 -0.01 -0.28 0.109 0.89 1.7
## 
##                        MR2  MR5  MR3  MR1  MR4  MR6
## SS loadings           2.58 2.14 2.05 2.01 1.82 0.78
## Proportion Var        0.09 0.08 0.07 0.07 0.07 0.03
## Cumulative Var        0.09 0.17 0.24 0.31 0.38 0.41
## Proportion Explained  0.23 0.19 0.18 0.18 0.16 0.07
## Cumulative Proportion 0.23 0.41 0.59 0.77 0.93 1.00
## 
##  With factor correlations of 
##       MR2   MR5   MR3   MR1   MR4   MR6
## MR2  1.00 -0.03 -0.21  0.23 -0.02  0.11
## MR5 -0.03  1.00  0.18 -0.26  0.24  0.03
## MR3 -0.21  0.18  1.00 -0.17  0.22  0.00
## MR1  0.23 -0.26 -0.17  1.00 -0.18 -0.05
## MR4 -0.02  0.24  0.22 -0.18  1.00  0.03
## MR6  0.11  0.03  0.00 -0.05  0.03  1.00
## 
## Mean item complexity =  1.8
## Test of the hypothesis that 6 factors are sufficient.
## 
## The degrees of freedom for the null model are  378  and the objective function was  7.67 with Chi Square of  10648.54
## The degrees of freedom for the model are 225  and the objective function was  0.6 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  1375 with the empirical chi square  624.42  with prob <  2.9e-39 
## The total number of observations was  1400  with Likelihood Chi Square =  836.09  with prob <  3.8e-71 
## 
## Tucker Lewis Index of factoring reliability =  0.9
## RMSEA index =  0.044  and the 90 % confidence intervals are  0.041 0.047
## BIC =  -793.86
## Fit based upon off diagonal values = 0.98
## Measures of factor score adequacy             
##                                                    MR2  MR5  MR3  MR1  MR4
## Correlation of (regression) scores with factors   0.93 0.89 0.88 0.89 0.86
## Multiple R square of scores with factors          0.86 0.78 0.78 0.78 0.75
## Minimum correlation of possible factor scores     0.72 0.57 0.56 0.57 0.50
##                                                    MR6
## Correlation of (regression) scores with factors   0.75
## Multiple R square of scores with factors          0.56
## Minimum correlation of possible factor scores     0.12
# Run the EFA with six factors (as indicated by your scree plot)
EFA_model <- fa(bfi_EFA, nfactors=6)

# View items' factor loadings
EFA_model$loadings
## 
## Loadings:
##           MR2    MR5    MR3    MR1    MR4    MR6   
## A1               -0.391                       0.510
## A2                0.600                      -0.192
## A3                0.651                            
## A4                0.440  0.187        -0.124       
## A5        -0.166  0.588        -0.121  0.100       
## C1                       0.542  0.115  0.198       
## C2                0.119  0.591  0.209  0.112  0.170
## C3                0.105  0.539  0.124              
## C4                      -0.687  0.107         0.191
## C5         0.107        -0.573  0.193  0.102       
## E1        -0.125 -0.139         0.576              
## E2                              0.692              
## E3                0.260        -0.296  0.408  0.126
## E4                0.376        -0.479         0.255
## E5         0.151         0.241 -0.354  0.309       
## N1         0.803 -0.101                            
## N2         0.829                                   
## N3         0.695                0.124              
## N4         0.434        -0.148  0.382              
## N5         0.532  0.216         0.208 -0.152       
## O1                                     0.578       
## O2         0.132  0.189               -0.382  0.242
## O3                                     0.674       
## O4         0.108  0.171         0.386  0.353       
## O5                0.103               -0.463  0.281
## gender     0.240  0.232  0.143 -0.109 -0.173 -0.168
## education                                    -0.210
## age                      0.103 -0.108        -0.276
## 
##                  MR2   MR5   MR3   MR1   MR4   MR6
## SS loadings    2.490 1.965 1.919 1.796 1.715 0.777
## Proportion Var 0.089 0.070 0.069 0.064 0.061 0.028
## Cumulative Var 0.089 0.159 0.228 0.292 0.353 0.381
# View the first few lines of examinees' factor scores
head(EFA_model$scores)
##              MR2        MR5         MR3         MR1        MR4        MR6
## 62551         NA         NA          NA          NA         NA         NA
## 67093 -0.7724195  0.5611340 -0.98862424 -0.35665187  0.4818819 1.16616564
## 62162 -1.1613038  0.7091861  0.01602322  0.25442770 -0.3954765 0.61004704
## 61896 -0.1630297  0.3951015  0.20243968 -0.88470545  0.1207347 0.04158926
## 67438  0.1715825 -1.5105203 -0.13152183  0.09266676 -0.3904054 1.04839765
## 63328  1.3551168  0.5068694  1.21405963 -0.52823218  0.7409618 0.74609206
# Run each theorized EFA on your dataset
bfi_theory <- fa(bfi_EFA, nfactors = 5)
bfi_eigen <- fa(bfi_EFA, nfactors = 6)

# Compare the BIC values
bfi_theory$BIC
## [1] -511.141
bfi_eigen$BIC
## [1] -793.8647

Chapter 3 - Confirmatory Factor Analysis

Setting up CFA:

  • Confirmatory analysis is based on explicitly defined factor relationships, to confirm a previously developed theory
  • Can use the results from an EFA as the baseline for a CFA; for example, reversing the wording of items with a negative loading
    • EFA_syn <- structure.sem(EFA_model)
    • EFA_syn # Path goes from factor to item, while an NA for Value means that the starting parameter will be chosen randomly
  • Can also create CFA syntax from theory explicitly, for example
    • theory_syn_eq <- "
    • AGE: A1, A2, A3, A4, A5 #Agreeableness
    • CON: C1, C2, C3, C4, C5 #Conscientiousness
    • EXT: E1, E2, E3, E4, E5 #Extraversion
    • NEU: N1, N2, N3, N4, N5 #Neuroticism
    • OPE: O1, O2, O3, O4, O5 #Openness
    • "
    • theory_syn <- cfa(text = theory_syn_eq, reference.indicators = FALSE) # sets the factor variances to 1 rather than estimating them

Understanding the sem() syntax:

  • Syntax will show the Path, Parameter, and Starting Value
    • Factor variances are shown using <-> arrows, with Start Value of 1 since they have been fixed as 1
    • Factor covariances also use <-> arrows
    • Item-level variances are also shown using <-> arrows
  • Can run the model using the sem() function with the syntax object
    • theory_CFA <- sem(theory_syn, data = bfi_CFA)
    • summary(theory_CFA)

Investigating model fit:

  • The chi-squared test (log-likelihood test) is the only test printed by default, though it will usually be significant for a large dataset size
  • Can change the global options so that additional tests are run
    • options(fit.indices = c(“CFI”, “GFI”, “RMSEA”, “BIC”))
    • summary(theory_CFA)
    • summary(theory_CFA)$BIC # lower BIC is preferred, but is only useful for nested models on the same dataset

Example code includes:

# Conduct a five-factor EFA on the EFA half of the dataset
EFA_model <- fa(bfi_EFA, nfactors = 5)

# Use the wrapper function to create syntax for use with the sem() function
EFA_syn <- structure.sem(EFA_model)


# Set up syntax specifying which items load onto each factor
theory_syn_eq <- "
AGE: A1, A2, A3, A4, A5
CON: C1, C2, C3, C4, C5
EXT: E1, E2, E3, E4, E5
NEU: N1, N2, N3, N4, N5
OPE: O1, O2, O3, O4, O5
"

library(sem)
## 
## Attaching package: 'sem'
## The following objects are masked from 'package:lavaan':
## 
##     cfa, sem
# Feed the syntax in to have variances and covariances automatically added
theory_syn <- cfa(text = theory_syn_eq, reference.indicators = FALSE)
## NOTE: adding 25 variances to the model
# Use the sem() function to run a CFA
theory_CFA <- sem(theory_syn, data = bfi_CFA)
## Warning in sem.semmod(theory_syn, data = bfi_CFA): -289 observations
## removed due to missingness
## Warning in sem.semmod(theory_syn, data = bfi_CFA): The following observed variables are in the input covariance or raw-moment matrix but do not appear in the model:
## gender, education, age
# Use the summary function to view fit information and parameter estimates
summary(theory_CFA)
## 
##  Model Chisquare =  2005.016   Df =  265 Pr(>Chisq) = 2.150816e-264
##  AIC =  2125.016
##  BIC =  146.5663
## 
##  Normalized Residuals
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -6.8135 -0.5790  0.7005  0.9165  2.3369  9.1189 
## 
##  R-square for Endogenous Variables
##     A1     A2     A3     A4     A5     C1     C2     C3     C4     C5 
## 0.1177 0.4221 0.5726 0.2652 0.4584 0.3015 0.3895 0.2854 0.4763 0.3806 
##     E1     E2     E3     E4     E5     N1     N2     N3     N4     N5 
## 0.3100 0.4955 0.4090 0.5185 0.3099 0.6809 0.6209 0.5142 0.3434 0.2209 
##     O1     O2     O3     O4     O5 
## 0.3230 0.2294 0.4473 0.0615 0.2625 
## 
##  Parameter Estimates
##             Estimate    Std Error  z value    Pr(>|z|)                  
## lam[A1:AGE] -0.47310342 0.04469539 -10.585060  3.495553e-26 A1 <--- AGE 
## lam[A2:AGE]  0.74682312 0.03424190  21.810210 1.856211e-105 A2 <--- AGE 
## lam[A3:AGE]  0.97246742 0.03690975  26.347171 5.530212e-153 A3 <--- AGE 
## lam[A4:AGE]  0.73031369 0.04413591  16.546928  1.684830e-61 A4 <--- AGE 
## lam[A5:AGE]  0.84567717 0.03685897  22.943592 1.707357e-116 A5 <--- AGE 
## lam[C1:CON] -0.67102860 0.03893437 -17.234866  1.453902e-66 C1 <--- CON 
## lam[C2:CON] -0.81745260 0.04090714 -19.983130  7.722751e-89 C2 <--- CON 
## lam[C3:CON] -0.67484497 0.04040327 -16.702731  1.252010e-62 C3 <--- CON 
## lam[C4:CON]  0.94244534 0.04196174  22.459632 1.030377e-111 C4 <--- CON 
## lam[C5:CON]  1.00338850 0.05089376  19.715355  1.591951e-86 C5 <--- CON 
## lam[E1:EXT]  0.88897886 0.04856677  18.304263  7.652070e-75 E1 <--- EXT 
## lam[E2:EXT]  1.12282478 0.04596055  24.430185 8.174273e-132 E2 <--- EXT 
## lam[E3:EXT] -0.86265804 0.03985228 -21.646390 6.573210e-104 E3 <--- EXT 
## lam[E4:EXT] -1.04778146 0.04165946 -25.151108 1.374315e-139 E4 <--- EXT 
## lam[E5:EXT] -0.73878674 0.04037337 -18.298864  8.449171e-75 E5 <--- EXT 
## lam[N1:NEU]  1.28350486 0.04147064  30.949724 2.562431e-210 N1 <--- NEU 
## lam[N2:NEU]  1.20126201 0.04131090  29.078570 6.700813e-186 N2 <--- NEU 
## lam[N3:NEU]  1.15489837 0.04498842  25.671017 2.463597e-145 N3 <--- NEU 
## lam[N4:NEU]  0.92407395 0.04646996  19.885404  5.444165e-88 N4 <--- NEU 
## lam[N5:NEU]  0.76200858 0.04958272  15.368429  2.665543e-53 N5 <--- NEU 
## lam[O1:OPE]  0.62901029 0.03848638  16.343713  4.822431e-60 O1 <--- OPE 
## lam[O2:OPE] -0.75677801 0.05522087 -13.704565  9.533735e-43 O2 <--- OPE 
## lam[O3:OPE]  0.79299695 0.04156934  19.076487  3.959980e-81 O3 <--- OPE 
## lam[O4:OPE]  0.30437169 0.04406913   6.906687  4.961028e-12 O4 <--- OPE 
## lam[O5:OPE] -0.68830815 0.04680655 -14.705379  5.953631e-49 O5 <--- OPE 
## C[AGE,CON]  -0.33903475 0.03616712  -9.374116  6.975805e-21 CON <--> AGE
## C[AGE,EXT]  -0.71675158 0.02419566 -29.623150 7.522744e-193 EXT <--> AGE
## C[AGE,NEU]  -0.23767064 0.03540025  -6.713812  1.896045e-11 NEU <--> AGE
## C[AGE,OPE]   0.23285485 0.04030234   5.777700  7.572854e-09 OPE <--> AGE
## C[CON,EXT]   0.38058118 0.03501044  10.870507  1.593130e-27 EXT <--> CON
## C[CON,NEU]   0.25969556 0.03567111   7.280277  3.331352e-13 NEU <--> CON
## C[CON,OPE]  -0.28622586 0.04021418  -7.117536  1.098734e-12 OPE <--> CON
## C[EXT,NEU]   0.27605059 0.03454400   7.991276  1.335494e-15 NEU <--> EXT
## C[EXT,OPE]  -0.34443896 0.03822892  -9.009905  2.062345e-19 OPE <--> EXT
## C[NEU,OPE]  -0.09901255 0.03959699  -2.500507  1.240156e-02 OPE <--> NEU
## V[A1]        1.67705932 0.07357444  22.794048 5.252890e-115 A1 <--> A1  
## V[A2]        0.76354863 0.03963774  19.263174  1.094593e-82 A2 <--> A2  
## V[A3]        0.70576489 0.04510292  15.647875  3.435002e-55 A3 <--> A3  
## V[A4]        1.47810259 0.06883999  21.471570 2.871544e-102 A4 <--> A4  
## V[A5]        0.84511989 0.04553868  18.558285  6.990034e-77 A5 <--> A5  
## V[C1]        1.04294084 0.05130818  20.326989  7.421893e-92 C1 <--> C1  
## V[C2]        1.04727939 0.05588028  18.741486  2.271949e-78 C2 <--> C2  
## V[C3]        1.14007575 0.05541605  20.573025  4.788395e-94 C3 <--> C3  
## V[C4]        0.97670448 0.05854026  16.684321  1.704330e-62 C4 <--> C4  
## V[C5]        1.63858678 0.08658845  18.923849  7.255663e-80 C5 <--> C5  
## V[E1]        1.75863096 0.08303725  21.178820  1.497351e-99 E1 <--> E1  
## V[E2]        1.28369574 0.07018375  18.290498  9.851190e-75 E2 <--> E2  
## V[E3]        1.07543181 0.05412632  19.868927  7.560203e-88 E3 <--> E3  
## V[E4]        1.01968201 0.05734980  17.780043  1.009021e-70 E4 <--> E4  
## V[E5]        1.21546936 0.05738608  21.180563  1.442987e-99 E5 <--> E5  
## V[N1]        0.77202708 0.05416344  14.253656  4.254140e-46 N1 <--> N1  
## V[N2]        0.88100921 0.05399874  16.315365  7.674585e-60 N2 <--> N2  
## V[N3]        1.25987596 0.06631127  18.999423  1.724296e-80 N3 <--> N3  
## V[N4]        1.63271952 0.07642767  21.362938 2.956188e-101 N4 <--> N4  
## V[N5]        2.04833227 0.09152194  22.380779 6.057690e-111 N5 <--> N5  
## V[O1]        0.82913181 0.04602848  18.013452  1.527947e-72 O1 <--> O1  
## V[O2]        1.92399378 0.09536054  20.175996  1.591273e-90 O2 <--> O2  
## V[O3]        0.77711574 0.05441673  14.280824  2.881740e-46 O3 <--> O3  
## V[O4]        1.41350151 0.06190555  22.833195 2.146796e-115 O4 <--> O4  
## V[O5]        1.33126454 0.06832159  19.485268  1.464112e-84 O5 <--> O5  
## 
##  Iterations =  26
# CAUTION THAT THIS WILL SET GLOBAL OPTIONS
# Set the options to include various fit indices so they will print
origFit <- getOption("fit.indices")
options(fit.indices = c("CFI", "GFI", "RMSEA", "BIC"))

# Use the summary function to view fit information and parameter estimates
summary(theory_CFA)
## 
##  Model Chisquare =  2005.016   Df =  265 Pr(>Chisq) = 2.150816e-264
##  Goodness-of-fit index =  0.8594733
##  RMSEA index =  0.07691165   90% CI: (NA, NA)
##  Bentler CFI =  0.7863506
##  BIC =  146.5663
## 
##  Normalized Residuals
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -6.8135 -0.5790  0.7005  0.9165  2.3369  9.1189 
## 
##  R-square for Endogenous Variables
##     A1     A2     A3     A4     A5     C1     C2     C3     C4     C5 
## 0.1177 0.4221 0.5726 0.2652 0.4584 0.3015 0.3895 0.2854 0.4763 0.3806 
##     E1     E2     E3     E4     E5     N1     N2     N3     N4     N5 
## 0.3100 0.4955 0.4090 0.5185 0.3099 0.6809 0.6209 0.5142 0.3434 0.2209 
##     O1     O2     O3     O4     O5 
## 0.3230 0.2294 0.4473 0.0615 0.2625 
## 
##  Parameter Estimates
##             Estimate    Std Error  z value    Pr(>|z|)                  
## lam[A1:AGE] -0.47310342 0.04469539 -10.585060  3.495553e-26 A1 <--- AGE 
## lam[A2:AGE]  0.74682312 0.03424190  21.810210 1.856211e-105 A2 <--- AGE 
## lam[A3:AGE]  0.97246742 0.03690975  26.347171 5.530212e-153 A3 <--- AGE 
## lam[A4:AGE]  0.73031369 0.04413591  16.546928  1.684830e-61 A4 <--- AGE 
## lam[A5:AGE]  0.84567717 0.03685897  22.943592 1.707357e-116 A5 <--- AGE 
## lam[C1:CON] -0.67102860 0.03893437 -17.234866  1.453902e-66 C1 <--- CON 
## lam[C2:CON] -0.81745260 0.04090714 -19.983130  7.722751e-89 C2 <--- CON 
## lam[C3:CON] -0.67484497 0.04040327 -16.702731  1.252010e-62 C3 <--- CON 
## lam[C4:CON]  0.94244534 0.04196174  22.459632 1.030377e-111 C4 <--- CON 
## lam[C5:CON]  1.00338850 0.05089376  19.715355  1.591951e-86 C5 <--- CON 
## lam[E1:EXT]  0.88897886 0.04856677  18.304263  7.652070e-75 E1 <--- EXT 
## lam[E2:EXT]  1.12282478 0.04596055  24.430185 8.174273e-132 E2 <--- EXT 
## lam[E3:EXT] -0.86265804 0.03985228 -21.646390 6.573210e-104 E3 <--- EXT 
## lam[E4:EXT] -1.04778146 0.04165946 -25.151108 1.374315e-139 E4 <--- EXT 
## lam[E5:EXT] -0.73878674 0.04037337 -18.298864  8.449171e-75 E5 <--- EXT 
## lam[N1:NEU]  1.28350486 0.04147064  30.949724 2.562431e-210 N1 <--- NEU 
## lam[N2:NEU]  1.20126201 0.04131090  29.078570 6.700813e-186 N2 <--- NEU 
## lam[N3:NEU]  1.15489837 0.04498842  25.671017 2.463597e-145 N3 <--- NEU 
## lam[N4:NEU]  0.92407395 0.04646996  19.885404  5.444165e-88 N4 <--- NEU 
## lam[N5:NEU]  0.76200858 0.04958272  15.368429  2.665543e-53 N5 <--- NEU 
## lam[O1:OPE]  0.62901029 0.03848638  16.343713  4.822431e-60 O1 <--- OPE 
## lam[O2:OPE] -0.75677801 0.05522087 -13.704565  9.533735e-43 O2 <--- OPE 
## lam[O3:OPE]  0.79299695 0.04156934  19.076487  3.959980e-81 O3 <--- OPE 
## lam[O4:OPE]  0.30437169 0.04406913   6.906687  4.961028e-12 O4 <--- OPE 
## lam[O5:OPE] -0.68830815 0.04680655 -14.705379  5.953631e-49 O5 <--- OPE 
## C[AGE,CON]  -0.33903475 0.03616712  -9.374116  6.975805e-21 CON <--> AGE
## C[AGE,EXT]  -0.71675158 0.02419566 -29.623150 7.522744e-193 EXT <--> AGE
## C[AGE,NEU]  -0.23767064 0.03540025  -6.713812  1.896045e-11 NEU <--> AGE
## C[AGE,OPE]   0.23285485 0.04030234   5.777700  7.572854e-09 OPE <--> AGE
## C[CON,EXT]   0.38058118 0.03501044  10.870507  1.593130e-27 EXT <--> CON
## C[CON,NEU]   0.25969556 0.03567111   7.280277  3.331352e-13 NEU <--> CON
## C[CON,OPE]  -0.28622586 0.04021418  -7.117536  1.098734e-12 OPE <--> CON
## C[EXT,NEU]   0.27605059 0.03454400   7.991276  1.335494e-15 NEU <--> EXT
## C[EXT,OPE]  -0.34443896 0.03822892  -9.009905  2.062345e-19 OPE <--> EXT
## C[NEU,OPE]  -0.09901255 0.03959699  -2.500507  1.240156e-02 OPE <--> NEU
## V[A1]        1.67705932 0.07357444  22.794048 5.252890e-115 A1 <--> A1  
## V[A2]        0.76354863 0.03963774  19.263174  1.094593e-82 A2 <--> A2  
## V[A3]        0.70576489 0.04510292  15.647875  3.435002e-55 A3 <--> A3  
## V[A4]        1.47810259 0.06883999  21.471570 2.871544e-102 A4 <--> A4  
## V[A5]        0.84511989 0.04553868  18.558285  6.990034e-77 A5 <--> A5  
## V[C1]        1.04294084 0.05130818  20.326989  7.421893e-92 C1 <--> C1  
## V[C2]        1.04727939 0.05588028  18.741486  2.271949e-78 C2 <--> C2  
## V[C3]        1.14007575 0.05541605  20.573025  4.788395e-94 C3 <--> C3  
## V[C4]        0.97670448 0.05854026  16.684321  1.704330e-62 C4 <--> C4  
## V[C5]        1.63858678 0.08658845  18.923849  7.255663e-80 C5 <--> C5  
## V[E1]        1.75863096 0.08303725  21.178820  1.497351e-99 E1 <--> E1  
## V[E2]        1.28369574 0.07018375  18.290498  9.851190e-75 E2 <--> E2  
## V[E3]        1.07543181 0.05412632  19.868927  7.560203e-88 E3 <--> E3  
## V[E4]        1.01968201 0.05734980  17.780043  1.009021e-70 E4 <--> E4  
## V[E5]        1.21546936 0.05738608  21.180563  1.442987e-99 E5 <--> E5  
## V[N1]        0.77202708 0.05416344  14.253656  4.254140e-46 N1 <--> N1  
## V[N2]        0.88100921 0.05399874  16.315365  7.674585e-60 N2 <--> N2  
## V[N3]        1.25987596 0.06631127  18.999423  1.724296e-80 N3 <--> N3  
## V[N4]        1.63271952 0.07642767  21.362938 2.956188e-101 N4 <--> N4  
## V[N5]        2.04833227 0.09152194  22.380779 6.057690e-111 N5 <--> N5  
## V[O1]        0.82913181 0.04602848  18.013452  1.527947e-72 O1 <--> O1  
## V[O2]        1.92399378 0.09536054  20.175996  1.591273e-90 O2 <--> O2  
## V[O3]        0.77711574 0.05441673  14.280824  2.881740e-46 O3 <--> O3  
## V[O4]        1.41350151 0.06190555  22.833195 2.146796e-115 O4 <--> O4  
## V[O5]        1.33126454 0.06832159  19.485268  1.464112e-84 O5 <--> O5  
## 
##  Iterations =  26
# Run a CFA using the EFA syntax you created earlier
EFA_CFA <- sem(EFA_syn, data = bfi_CFA)
## Warning in sem.semmod(EFA_syn, data = bfi_CFA): -289 observations removed
## due to missingness
# Locate the BIC in the fit statistics of the summary output
summary(EFA_CFA)$BIC
## [1] 480.1274
# Compare EFA_CFA BIC to the BIC from the CFA based on theory
summary(theory_CFA)$BIC
## [1] 146.5663
# Reset to baseline
options(fit.indices = origFit)

Chapter 4 - Refining Your Measure and Model

EFA vs CFA Revisited:

  • EFA is exploratory and looks at many possible relationships
  • CFA is confirmatory and based only on the loadings defined by a theoretical relationship; will have different loadings than EFA dur to different number of variables
  • Due to the rotations involved in EFA, the variables may have non-intuitive names
    • EFA_scores <- EFA_model$scores
    • CFA_scores <- fscores(EFA_CFA, data = bfi_EFA)
    • plot(density(EFA_scores[,1], na.rm = TRUE), xlim = c(-3, 3), ylim = c(0, 1), col = “blue”)
    • lines(density(CFA_scores[,1], na.rm = TRUE), xlim = c(-3, 3), ylim = c(0, 1), col = “red”)

Adding Loadings to Improve Fit:

  • Poor model fits are sometimes due to excluded loadings
  • Can alter the syntax to add the new loadings desired; OK to have some items theorized to load on to multiple factors
  • Can then run the updated model and look at the ANOVA
    • anova(theory_CFA, theory_CFA_add)

Improving Fit by Removing Loadings:

  • Can delete loadings rather than add loadings
    • theory_syn_del <- "
    • AGE: A1, A2, A3, A4, A5
    • CON: C1, C2, C3, C4, C5
    • EXT: E1, E2, E3, E4, E5
    • NEU: N1, N2, N3, N4, N5
    • OPE: O1, O2, O3, O5
    • "
    • theory_syn3 <- cfa(text = theory_syn_del, reference.indicators = FALSE)
    • theory_CFA_del <- sem(model = theory_syn3, data = bfi_CFA)

Wrap-Up:

  • Unidimensional and multideimensional models
  • EFA (exploratory) and CFA (confirmatory)
  • The psych and sem packages are available also

Example code includes:

# CAUTION THAT THIS WILL SET GLOBAL OPTIONS
# Set the options to include various fit indices so they will print
origFit <- getOption("fit.indices")
options(fit.indices = c("CFI", "GFI", "RMSEA", "BIC"))

# View the first five rows of the EFA loadings
EFA_model$loadings[1:5, ]
##             MR2        MR3          MR1        MR5         MR4
## A1  0.194806449 0.08286748 -0.150193662 -0.4264281 -0.01822345
## A2 -0.007392792 0.06744147 -0.009306157  0.6261111  0.04454954
## A3 -0.028215628 0.02767518 -0.105034038  0.6441602  0.02841403
## A4 -0.049129621 0.18695432 -0.091956712  0.4276816 -0.14686861
## A5 -0.123184595 0.01709411 -0.191147915  0.5553174  0.06932404
# View the first five loadings from the CFA estimated from the EFA results
summary(EFA_CFA)$coeff[1:5, ]
##        Estimate  Std Error   z value      Pr(>|z|)            
## F4A1 -0.5184257 0.04528396 -11.44833  2.397187e-30 A1 <--- MR5
## F4A2  0.7768131 0.03524249  22.04195 1.141266e-107 A2 <--- MR5
## F4A3  0.9968771 0.03885030  25.65944 3.317365e-145 A3 <--- MR5
## F4A4  0.7235088 0.04521136  16.00281  1.221362e-57 A4 <--- MR5
## F4A5  0.7768296 0.03870116  20.07251  1.283512e-89 A5 <--- MR5
# Extracting factor scores from the EFA model
EFA_scores <- EFA_model$scores

# Calculating factor scores by applying the CFA parameters to the EFA dataset
CFA_scores <- fscores(EFA_CFA, data = bfi_EFA)

# Comparing factor scores from the EFA and CFA results from the bfi_EFA dataset
plot(density(EFA_scores[,1], na.rm = TRUE), 
    xlim = c(-3, 3), ylim = c(0, 1), col = "blue")
lines(density(CFA_scores[,1], na.rm = TRUE), 
    xlim = c(-3, 3), ylim = c(0, 1), col = "red")

# Add some plausible item/factor loadings to the syntax
theory_syn_add <- "
AGE: A1, A2, A3, A4, A5
CON: C1, C2, C3, C4, C5
EXT: E1, E2, E3, E4, E5, N4
NEU: N1, N2, N3, N4, N5, E3
OPE: O1, O2, O3, O4, O5
"

# Convert your equations to sem-compatible syntax
theory_syn2 <- cfa(text = theory_syn_add, reference.indicators = FALSE)
## NOTE: adding 25 variances to the model
# Run a CFA with the revised syntax
theory_CFA_add <- sem(model = theory_syn2, data = bfi_CFA)
## Warning in sem.semmod(model = theory_syn2, data = bfi_CFA): -289
## observations removed due to missingness
## Warning in sem.semmod(model = theory_syn2, data = bfi_CFA): The following observed variables are in the input covariance or raw-moment matrix but do not appear in the model:
## gender, education, age
# Conduct a likelihood ratio test
anova(theory_CFA, theory_CFA_add)
## LR Test for Difference Between Models
## 
##                Model Df Model Chisq Df LR Chisq Pr(>Chisq)    
## theory_CFA          265      2005.0                           
## theory_CFA_add      263      1901.8  2   103.19  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare the comparative fit indices - higher is better!
summary(theory_CFA)$CFI
## [1] 0.7863506
summary(theory_CFA_add)$CFI
## [1] 0.7987748
# Compare the RMSEA values - lower is better!
summary(theory_CFA)$RMSEA
## [1] 0.07691165         NA         NA 0.90000000
summary(theory_CFA_add)$RMSEA
## [1] 0.07492514         NA         NA 0.90000000
# Compare BIC values
summary(theory_CFA)$BIC
## [1] 146.5663
summary(theory_CFA_add)$BIC
## [1] 57.40664
# Remove the weakest factor loading from the syntax
theory_syn_del <- "
AGE: A1, A2, A3, A4, A5
CON: C1, C2, C3, C4, C5
EXT: E1, E2, E3, E4, E5
NEU: N1, N2, N3, N4, N5
OPE: O1, O2, O3, O5
"

# Convert your equations to sem-compatible syntax
theory_syn3 <- cfa(text = theory_syn_del, reference.indicators = FALSE)
## NOTE: adding 24 variances to the model
# Run a CFA with the revised syntax
theory_CFA_del <- sem(model = theory_syn3, data = bfi_CFA)
## Warning in sem.semmod(model = theory_syn3, data = bfi_CFA): -289
## observations removed due to missingness
## Warning in sem.semmod(model = theory_syn3, data = bfi_CFA): The following observed variables are in the input covariance or raw-moment matrix but do not appear in the model:
## O4, gender, education, age
# Compare the comparative fit indices - higher is better!
summary(theory_CFA)$CFI
## [1] 0.7863506
summary(theory_CFA_del)$CFI
## [1] 0.7983846
# Compare the RMSEA values - lower is better!
summary(theory_CFA)$RMSEA
## [1] 0.07691165         NA         NA 0.90000000
summary(theory_CFA_del)$RMSEA
## [1] 0.07732379         NA         NA 0.90000000
# Compare BIC values
summary(theory_CFA)$BIC
## [1] 146.5663
summary(theory_CFA_del)$BIC
## [1] 150.9206
# Reset to baseline
options(fit.indices = origFit)

Generalized Linear Models in R

Chapter 1 - GLM - Extension of Regression Toolbox

Limitations of linear models:

  • Linear models are workhorses of data science - explaining variability with linear combinations of variables
    • lm(y ~ x, data = dat)
  • Linear models assume linear relationships, and normally distributed residuals
    • lm(formula = weight ~ Diet, data = ChickWeightEnd) # ChickWeightEnd is the FINAL endpoint
  • Sometimes want to model counts or survival or the like, where the basic linear model is not appropriate
    • Poisson family for count data
    • Binomial family for survival data
    • Link functions to convert the linear model to the relevant family
  • Can run GLM in R
    • glm( y ~ x, data = data, family = “gaussian”) # this is the same as lm()

Poisson regression:

  • The Poisson model is good for modeling count data - scores, visitors, cells, etc.
  • The Poisson distribution is always a non-negative integer, and with the same mean and variance
    • glm(goal ~ player, data = scores, family = “poisson”) # global intercept and then delta goals for player vs. reference level
    • glm(goal ~ player -1, data = scores, family = “poisson”) # average goals per player

Basic lm() functions with glm():

  • R gives some useful shortcuts when working with lm() and glm(); for example, automatic output printing
  • The summary() call on a regression model provides additional details about the regression
    • Can further use broom::tidy() to extract key data in a tidy format
    • Can use coef() and confint() to get the coefficients and the confidence intervals
  • Can also make predictions based on an existing model
    • predict(model, newData)

Example code includes:

data(ChickWeight, package="datasets")
ChickWeightEnd <- ChickWeight %>% 
    mutate(Chick=as.factor(as.integer(Chick))) %>%
    group_by(Chick) %>% 
    filter(Time==max(Time), !(Chick %in% c(1, 2, 3, 8, 41))) %>%
    ungroup()
glimpse(ChickWeightEnd)
## Observations: 45
## Variables: 4
## $ weight <dbl> 205, 215, 202, 157, 223, 157, 305, 98, 124, 175, 205, 9...
## $ Time   <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,...
## $ Chick  <fct> 15, 17, 14, 11, 18, 12, 20, 5, 7, 13, 16, 4, 19, 9, 10,...
## $ Diet   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2...
# Fit a lm()
lm(formula = weight ~ Diet, data = ChickWeightEnd)
## 
## Call:
## lm(formula = weight ~ Diet, data = ChickWeightEnd)
## 
## Coefficients:
## (Intercept)        Diet2        Diet3        Diet4  
##      177.75        36.95        92.55        60.81
# Fit a glm()
glm( formula = weight ~ Diet , data = ChickWeightEnd, family = 'gaussian')
## 
## Call:  glm(formula = weight ~ Diet, family = "gaussian", data = ChickWeightEnd)
## 
## Coefficients:
## (Intercept)        Diet2        Diet3        Diet4  
##      177.75        36.95        92.55        60.81  
## 
## Degrees of Freedom: 44 Total (i.e. Null);  41 Residual
## Null Deviance:       225000 
## Residual Deviance: 167800    AIC: 507.8
dat <- data.frame(time=1:30, 
                  count=c(0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, 1, 0, 0, 0, 2, 2, 1, 1, 4, 1, 1, 1, 1, 0, 0)
                  )
dat
##    time count
## 1     1     0
## 2     2     0
## 3     3     0
## 4     4     0
## 5     5     1
## 6     6     0
## 7     7     0
## 8     8     1
## 9     9     0
## 10   10     0
## 11   11     2
## 12   12     0
## 13   13     1
## 14   14     0
## 15   15     0
## 16   16     1
## 17   17     0
## 18   18     0
## 19   19     0
## 20   20     2
## 21   21     2
## 22   22     1
## 23   23     1
## 24   24     4
## 25   25     1
## 26   26     1
## 27   27     1
## 28   28     1
## 29   29     0
## 30   30     0
# fit y predicted by x with data.frame dat using the poisson family
poissonOut <- glm(count ~ time, data=dat, family="poisson")

# print the output
print(poissonOut)
## 
## Call:  glm(formula = count ~ time, family = "poisson", data = dat)
## 
## Coefficients:
## (Intercept)         time  
##    -1.43036      0.05815  
## 
## Degrees of Freedom: 29 Total (i.e. Null);  28 Residual
## Null Deviance:       35.63 
## Residual Deviance: 30.92     AIC: 66.02
# Fit a glm with count predicted by time using data.frame dat and gaussian family
lmOut <- glm(count ~ time, data=dat, family="gaussian")

summary(lmOut)
## 
## Call:
## glm(formula = count ~ time, family = "gaussian", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2022  -0.5190  -0.1497   0.2595   3.0194  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.09425    0.32891   0.287   0.7766  
## time         0.03693    0.01853   1.993   0.0561 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.7714815)
## 
##     Null deviance: 24.667  on 29  degrees of freedom
## Residual deviance: 21.601  on 28  degrees of freedom
## AIC: 81.283
## 
## Number of Fisher Scoring iterations: 2
summary(poissonOut)
## 
## Call:
## glm(formula = count ~ time, family = "poisson", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6547  -0.9666  -0.7226   0.3830   2.3022  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.43036    0.59004  -2.424   0.0153 *
## time         0.05815    0.02779   2.093   0.0364 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 35.627  on 29  degrees of freedom
## Residual deviance: 30.918  on 28  degrees of freedom
## AIC: 66.024
## 
## Number of Fisher Scoring iterations: 5
scores <- data.frame(player=rep(c("Sam", "Lou"), each=5), 
                     goal=c(1, 2, 0, 4, 3, 0, 0, 1, 0, 0)
                     )
scores
##    player goal
## 1     Sam    1
## 2     Sam    2
## 3     Sam    0
## 4     Sam    4
## 5     Sam    3
## 6     Lou    0
## 7     Lou    0
## 8     Lou    1
## 9     Lou    0
## 10    Lou    0
# Fit a glm() that estimates the difference between players
summary(glm(goal ~ player, data=scores, family="poisson"))
## 
## Call:
## glm(formula = goal ~ player, family = "poisson", data = scores)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0000  -0.6325  -0.6325   0.4934   1.2724  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -1.6094     0.9999  -1.610   0.1075  
## playerSam     2.3026     1.0487   2.196   0.0281 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18.3578  on 9  degrees of freedom
## Residual deviance:  9.8105  on 8  degrees of freedom
## AIC: 26.682
## 
## Number of Fisher Scoring iterations: 5
# Fit a glm() that estimates an intercept for each player 
summary(glm(goal ~ player - 1, data=scores, family="poisson"))
## 
## Call:
## glm(formula = goal ~ player - 1, family = "poisson", data = scores)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0000  -0.6325  -0.6325   0.4934   1.2724  
## 
## Coefficients:
##           Estimate Std. Error z value Pr(>|z|)  
## playerLou  -1.6094     0.9999  -1.610   0.1075  
## playerSam   0.6931     0.3162   2.192   0.0284 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18.4546  on 10  degrees of freedom
## Residual deviance:  9.8105  on  8  degrees of freedom
## AIC: 26.682
## 
## Number of Fisher Scoring iterations: 5
dat2 <- data.frame(Date=as.Date("2005-01-09")+1:4368, Number=0L) %>%
    mutate(Month=as.factor(lubridate::month(Date)))

eq1 <- c(1, 2, 6, 22, 42, 47, 48, 86, 96, 109, 113, 119, 190, 192, 208, 248, 264, 278, 306, 333, 334, 336, 368, 375, 392, 393, 408, 417, 424, 429, 439, 449, 455, 456, 500, 523, 536, 544, 545, 548, 550, 551, 586, 590, 597, 598, 673, 678, 700, 717, 740, 750, 755, 756, 767, 775, 793, 831, 859, 865, 866, 877, 885, 887, 895, 937, 1086, 1101, 1107, 1111, 1112, 1154, 1157, 1183, 1213, 1235, 1247, 1251, 1269, 1272, 1288, 1295, 1300, 1320, 1342, 1350, 1424, 1454, 1457, 1460, 1476, 1522, 1589, 1598, 1608, 1627, 1642, 1665, 1697, 1709, 1733, 1746, 1749, 1766, 1799, 1830, 1866, 1895, 1914, 1920, 1934, 1942, 1953, 1960, 1961, 1966, 1969, 1989, 2007, 2041, 2051, 2087, 2092, 2096, 2106, 2122, 2129, 2138, 2156, 2159, 2174, 2176, 2177, 2180, 2191, 2214, 2217, 2218, 2251, 2276, 2286, 2302, 2308, 2340, 2352, 2361, 2382, 2416, 2419, 2421, 2464, 2468, 2492, 2522, 2526, 2548, 2550, 2573, 2620, 2625, 2627, 2629, 2698, 2706, 2721, 2726, 2760, 2768, 2787, 2796, 2813, 2854, 2858, 2890, 2900, 2909, 2932, 2933, 2955, 2960, 2966, 2997, 3032, 3057, 3063, 3080, 3090, 3095, 3098, 3122, 3130, 3154, 3160, 3199, 3205, 3215, 3227, 3229, 3243, 3244, 3254, 3302, 3340, 3350, 3469, 3506, 3519, 3525, 3535, 3542, 3584, 3604, 3653, 3660, 3673, 3692, 3694, 3706, 3763, 3792, 3801, 3808, 3812, 3814, 3822, 3884, 3892, 4001, 4084, 4194, 4210, 4220, 4229, 4242, 4265, 4267, 4296, 4302, 4325, 4334, 4338, 4341, 4353, 4354, 4357, 4368)
eq2 <- c(21, 195, 308, 505, 522, 560, 913, 1202, 1353, 1439, 1473, 1484, 1614, 1717, 1808, 1940, 2110, 2391, 2407, 2535, 2716, 2748, 2949, 3313, 3421, 3671, 3967, 3991, 4281)
eq3 <- c(624, 776, 1364, 1585, 2063, 2109, 2196, 2569, 2576, 2607, 3399, 3533, 3607)
eq4 <- c(463, 1918, 2417, 3064, 3606)
eq5 <- c(13, 3826)
eq6 <- c(701, 2097)
eq7 <- c(2509, 4276)
eq9 <- c(1637)

dat2[eq1, "Number"] <- 1L
dat2[eq2, "Number"] <- 2L
dat2[eq3, "Number"] <- 3L
dat2[eq4, "Number"] <- 4L
dat2[eq5, "Number"] <- 5L
dat2[eq6, "Number"] <- 6L
dat2[eq7, "Number"] <- 7L
dat2[eq9, "Number"] <- 9L

str(dat2)
## 'data.frame':    4368 obs. of  3 variables:
##  $ Date  : Date, format: "2005-01-10" "2005-01-11" ...
##  $ Number: int  1 1 0 0 0 1 0 0 0 0 ...
##  $ Month : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
table(dat2$Number)
## 
##    0    1    2    3    4    5    6    7    9 
## 4068  246   29   13    5    2    2    2    1
table(dat2$Month)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 363 339 372 360 372 360 372 372 360 372 360 366
# build your models
lmOut <- lm(Number ~ Month, data=dat2) 
poissonOut <- glm(Number ~ Month, data=dat2, family="poisson")

# examine the outputs using print
print(lmOut)
## 
## Call:
## lm(formula = Number ~ Month, data = dat2)
## 
## Coefficients:
## (Intercept)       Month2       Month3       Month4       Month5  
##    0.129477    -0.038031    -0.078401    -0.057254    -0.032702  
##      Month6       Month7       Month8       Month9      Month10  
##   -0.043365    -0.005821    -0.051520    -0.023921    -0.054208  
##     Month11      Month12  
##   -0.023921    -0.022919
print(poissonOut)
## 
## Call:  glm(formula = Number ~ Month, family = "poisson", data = dat2)
## 
## Coefficients:
## (Intercept)       Month2       Month3       Month4       Month5  
##     -2.0443      -0.3478      -0.9302      -0.5838      -0.2911  
##      Month6       Month7       Month8       Month9      Month10  
##     -0.4079      -0.0460      -0.5073      -0.2043      -0.5424  
##     Month11      Month12  
##     -0.2043      -0.1948  
## 
## Degrees of Freedom: 4367 Total (i.e. Null);  4356 Residual
## Null Deviance:       2325 
## Residual Deviance: 2303  AIC: 2976
# examine the outputs using summary
summary(lmOut)
## 
## Call:
## lm(formula = Number ~ Month, data = dat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1295 -0.1056 -0.0914 -0.0753  8.8763 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.129477   0.022770   5.686 1.38e-08 ***
## Month2      -0.038031   0.032767  -1.161   0.2458    
## Month3      -0.078401   0.032007  -2.450   0.0143 *  
## Month4      -0.057254   0.032269  -1.774   0.0761 .  
## Month5      -0.032702   0.032007  -1.022   0.3070    
## Month6      -0.043365   0.032269  -1.344   0.1791    
## Month7      -0.005821   0.032007  -0.182   0.8557    
## Month8      -0.051520   0.032007  -1.610   0.1075    
## Month9      -0.023921   0.032269  -0.741   0.4586    
## Month10     -0.054208   0.032007  -1.694   0.0904 .  
## Month11     -0.023921   0.032269  -0.741   0.4586    
## Month12     -0.022919   0.032136  -0.713   0.4758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4338 on 4356 degrees of freedom
## Multiple R-squared:  0.00249,    Adjusted R-squared:  -2.927e-05 
## F-statistic: 0.9884 on 11 and 4356 DF,  p-value: 0.4542
summary(poissonOut)
## 
## Call:
## glm(formula = Number ~ Month, family = "poisson", data = dat2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5089  -0.4595  -0.4277  -0.3880   7.7086  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.0443     0.1459 -14.015  < 2e-16 ***
## Month2       -0.3478     0.2314  -1.503 0.132839    
## Month3       -0.9302     0.2719  -3.422 0.000623 ***
## Month4       -0.5837     0.2444  -2.388 0.016923 *  
## Month5       -0.2911     0.2215  -1.314 0.188706    
## Month6       -0.4079     0.2314  -1.763 0.077939 .  
## Month7       -0.0460     0.2074  -0.222 0.824486    
## Month8       -0.5073     0.2361  -2.149 0.031671 *  
## Month9       -0.2043     0.2182  -0.936 0.349112    
## Month10      -0.5424     0.2387  -2.272 0.023075 *  
## Month11      -0.2043     0.2182  -0.936 0.349112    
## Month12      -0.1948     0.2166  -0.899 0.368434    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2325.3  on 4367  degrees of freedom
## Residual deviance: 2302.7  on 4356  degrees of freedom
## AIC: 2975.6
## 
## Number of Fisher Scoring iterations: 6
# examine the outputs using tidy
broom::tidy(lmOut)
## # A tibble: 12 x 5
##    term        estimate std.error statistic      p.value
##    <chr>          <dbl>     <dbl>     <dbl>        <dbl>
##  1 (Intercept)  0.129      0.0228     5.69  0.0000000138
##  2 Month2      -0.0380     0.0328    -1.16  0.246       
##  3 Month3      -0.0784     0.0320    -2.45  0.0143      
##  4 Month4      -0.0573     0.0323    -1.77  0.0761      
##  5 Month5      -0.0327     0.0320    -1.02  0.307       
##  6 Month6      -0.0434     0.0323    -1.34  0.179       
##  7 Month7      -0.00582    0.0320    -0.182 0.856       
##  8 Month8      -0.0515     0.0320    -1.61  0.108       
##  9 Month9      -0.0239     0.0323    -0.741 0.459       
## 10 Month10     -0.0542     0.0320    -1.69  0.0904      
## 11 Month11     -0.0239     0.0323    -0.741 0.459       
## 12 Month12     -0.0229     0.0321    -0.713 0.476
broom::tidy(poissonOut)
## # A tibble: 12 x 5
##    term        estimate std.error statistic  p.value
##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)  -2.04       0.146   -14.0   1.27e-44
##  2 Month2       -0.348      0.231    -1.50  1.33e- 1
##  3 Month3       -0.930      0.272    -3.42  6.23e- 4
##  4 Month4       -0.584      0.244    -2.39  1.69e- 2
##  5 Month5       -0.291      0.221    -1.31  1.89e- 1
##  6 Month6       -0.408      0.231    -1.76  7.79e- 2
##  7 Month7       -0.0460     0.207    -0.222 8.24e- 1
##  8 Month8       -0.507      0.236    -2.15  3.17e- 2
##  9 Month9       -0.204      0.218    -0.936 3.49e- 1
## 10 Month10      -0.542      0.239    -2.27  2.31e- 2
## 11 Month11      -0.204      0.218    -0.936 3.49e- 1
## 12 Month12      -0.195      0.217    -0.899 3.68e- 1
# Extract the regression coefficients
coef(poissonOut)
## (Intercept)      Month2      Month3      Month4      Month5      Month6 
## -2.04425523 -0.34775767 -0.93019964 -0.58375226 -0.29111968 -0.40786159 
##      Month7      Month8      Month9     Month10     Month11     Month12 
## -0.04599723 -0.50734279 -0.20426264 -0.54243411 -0.20426264 -0.19481645
# Extract the confidence intervals
confint(poissonOut)
## Waiting for profiling to be done...
##                  2.5 %      97.5 %
## (Intercept) -2.3444432 -1.77136313
## Month2      -0.8103027  0.10063404
## Month3      -1.4866061 -0.41424128
## Month4      -1.0762364 -0.11342457
## Month5      -0.7311289  0.14051326
## Month6      -0.8704066  0.04053012
## Month7      -0.4542037  0.36161360
## Month8      -0.9807831 -0.05092540
## Month9      -0.6367321  0.22171492
## Month10     -1.0218277 -0.08165226
## Month11     -0.6367321  0.22171492
## Month12     -0.6237730  0.22851779
# use the model to predict with new data 
newDat <- data.frame(Month=as.factor(6:8))
predOut <- predict(object = poissonOut, newdata = newDat, type = "response")

# print the predictions
print(predOut)
##          1          2          3 
## 0.08611111 0.12365591 0.07795699

Chapter 2 - Logistic Regression

Overview of logistic regression:

  • Commonly used for making win/loss or survive/die predictions - binary data such as 0/1, Coke/Pepsi, W/L, etc.
  • The logistic regression is the default for GLM with family “binomial”
    • The logit link transforms probabilities to log-odds, while the inverse logit transforms log-odds to probabilities
  • Can fit the logistic regression in R using the default link in the glm()
    • glm(y ~ x, data = dat, family = ‘binomial’)

Bernoulli vs. Binomial Distribution:

  • Binomial and Bernoulli distributions are the foundation of logistic regression
    • Bernoulli models a single event (for example, the likelihood of heads in 1 coin flip)
    • Binomial models multiple events at the same time (for example, the number of heads in 10 coin flips)
  • Several options for entering data in R for use in logistic regression
    • Long format (Bernoulli) - vector of outcomes
    • Wide format (binomial) - proportions of success with weights, such as looking at groups
  • The appropriate input structure depends on the underlying data - groups vs. individuals

Link functions - probit compared with logit:

  • Link functions are important for understanding and simulating GLMs
  • The probit link function is another option (rather than logit) for the link function of the binomial
    • The “probit” is an abbreviation for probability unit, and is computattionally easier than the logit which was important when computers were slower
    • The probit returns a z-score, but with thinner tails than the probit (so the logit is often better for modeling outliers)
  • Need to convert z-scores to probabilities, then can run the rbinom function

Example code includes:

busData <- readr::read_csv("./RInputFiles/busData.csv")
## Parsed with column specification:
## cols(
##   CommuteDays = col_double(),
##   MilesOneWay = col_double(),
##   Bus = col_character()
## )
bus <- busData %>%
    mutate(Bus=factor(Bus, levels=c("No", "Yes")))
glimpse(bus)
## Observations: 15,892
## Variables: 3
## $ CommuteDays <dbl> 5, 5, 5, 5, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
## $ MilesOneWay <dbl> 19.54675, 19.54675, 19.54675, 19.54675, 19.54675, ...
## $ Bus         <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, No, No, ...
# Build a glm that models Bus predicted by CommuteDays
# using data.frame bus. Remember to use a binomial family.
busOut <- glm(Bus ~ CommuteDays, data=bus, family="binomial")

# Print the busOut (be sure to use the print() function)
print(busOut)
## 
## Call:  glm(formula = Bus ~ CommuteDays, family = "binomial", data = bus)
## 
## Coefficients:
## (Intercept)  CommuteDays  
##     -1.4549       0.1299  
## 
## Degrees of Freedom: 15891 Total (i.e. Null);  15890 Residual
## Null Deviance:       19570 
## Residual Deviance: 19540     AIC: 19540
# Look at the summary() of busOut
summary(busOut)
## 
## Call:
## glm(formula = Bus ~ CommuteDays, family = "binomial", data = bus)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9560  -0.8595  -0.8595   1.5330   1.7668  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.45493    0.11471 -12.683  < 2e-16 ***
## CommuteDays  0.12985    0.02312   5.616 1.96e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19568  on 15891  degrees of freedom
## Residual deviance: 19536  on 15890  degrees of freedom
## AIC: 19540
## 
## Number of Fisher Scoring iterations: 4
# Look at the tidy() output of busOut
broom::tidy(busOut)
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -1.45     0.115     -12.7  7.32e-37
## 2 CommuteDays    0.130    0.0231      5.62 1.96e- 8
# Simulate 1 draw with a sample size of 100
binomialSim <- rbinom(n=1, size=100, prob=0.5)

# Simulate 100 draw with a sample size of 1 
BernoulliSim <- rbinom(n=100, size=1, prob=0.5)

# Print the results from the binomial
print(binomialSim)
## [1] 47
# Sum the results from the Bernoulli
sum(BernoulliSim)
## [1] 46
dataLong <- data.frame(x=factor(rep(c("a", "b"), each=14), levels=c("a", "b")), 
                       y=factor(c('fail', 'fail', 'fail', 'fail', 'success', 'fail', 'fail', 'fail', 'fail', 'fail', 'fail', 'fail', 'fail', 'success', 'success', 'fail', 'success', 'success', 'success', 'success', 'success', 'success', 'success', 'success', 'success', 'fail', 'success', 'fail'), levels=c("fail", "success"))
                       )
str(dataLong)
## 'data.frame':    28 obs. of  2 variables:
##  $ x: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
##  $ y: Factor w/ 2 levels "fail","success": 1 1 1 1 2 1 1 1 1 1 ...
# Fit a a long format logistic regression
lr_1 <- glm(y ~ x, data=dataLong, family="binomial")
print(lr_1)
## 
## Call:  glm(formula = y ~ x, family = "binomial", data = dataLong)
## 
## Coefficients:
## (Intercept)           xb  
##      -1.792        3.091  
## 
## Degrees of Freedom: 27 Total (i.e. Null);  26 Residual
## Null Deviance:       38.67 
## Residual Deviance: 26.03     AIC: 30.03
dataWide <- dataLong %>%
    group_by(x) %>%
    summarize(fail=sum(y=="fail"), success=sum(y=="success"), Total=n(), successProportion = success/Total)
dataWide
## # A tibble: 2 x 5
##   x      fail success Total successProportion
##   <fct> <int>   <int> <int>             <dbl>
## 1 a        12       2    14             0.143
## 2 b         3      11    14             0.786
# Fit a wide form logistic regression
lr_2 <- glm(cbind(fail, success) ~ x, data=dataWide, family="binomial")

# Fit a a weighted form logistic regression
lr_3 <- glm(successProportion ~ x, weights=Total, data=dataWide, family="binomial")

# print your results
print(lr_2)
## 
## Call:  glm(formula = cbind(fail, success) ~ x, family = "binomial", 
##     data = dataWide)
## 
## Coefficients:
## (Intercept)           xb  
##       1.792       -3.091  
## 
## Degrees of Freedom: 1 Total (i.e. Null);  0 Residual
## Null Deviance:       12.64 
## Residual Deviance: -4.441e-16    AIC: 9.215
print(lr_3)
## 
## Call:  glm(formula = successProportion ~ x, family = "binomial", data = dataWide, 
##     weights = Total)
## 
## Coefficients:
## (Intercept)           xb  
##      -1.792        3.091  
## 
## Degrees of Freedom: 1 Total (i.e. Null);  0 Residual
## Null Deviance:       12.64 
## Residual Deviance: 4.441e-15     AIC: 9.215
# Fit a GLM with a logit link and save it as busLogit
busLogit <- glm(Bus ~ CommuteDays, data = bus, family = binomial(link = "logit"))

# Fit a GLM with probit link and save it as busProbit
busProbit <- glm(Bus ~ CommuteDays, data = bus, family = binomial(link = "probit"))

# Print model summaries
summary(busLogit)
## 
## Call:
## glm(formula = Bus ~ CommuteDays, family = binomial(link = "logit"), 
##     data = bus)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9560  -0.8595  -0.8595   1.5330   1.7668  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.45493    0.11471 -12.683  < 2e-16 ***
## CommuteDays  0.12985    0.02312   5.616 1.96e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19568  on 15891  degrees of freedom
## Residual deviance: 19536  on 15890  degrees of freedom
## AIC: 19540
## 
## Number of Fisher Scoring iterations: 4
summary(busProbit)
## 
## Call:
## glm(formula = Bus ~ CommuteDays, family = binomial(link = "probit"), 
##     data = bus)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9545  -0.8596  -0.8596   1.5328   1.7706  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.88951    0.06833 -13.017  < 2e-16 ***
## CommuteDays  0.07810    0.01380   5.658 1.53e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19568  on 15891  degrees of freedom
## Residual deviance: 19536  on 15890  degrees of freedom
## AIC: 19540
## 
## Number of Fisher Scoring iterations: 4
# Convert from the logit scale to a probability
p <- dlogis(0)

# Simulate a logit 
rbinom(n=10, size=1, prob=p)
##  [1] 0 0 1 0 0 0 0 0 0 0
# Convert from the probit scale to a probability
p <- pnorm(0)

# Simulate a probit
rbinom(n=10, size=1, prob=p)
##  [1] 0 0 0 1 0 0 1 0 1 1

Chapter 3 - Interpreting and Visualizing GLMs

Poisson Regression Coefficients:

  • Linear models are additive, but we may want to use the linear model with a link to an exponential (multiplicative)
  • The Poisson model is multiplicative while the linear model is additive
    • poissonOut <- glm(y ~ x, family = ‘poisson’)
    • tidy(poissonOut, exponentiate = TRUE) # exponentiate the coefficients
  • A significant Poisson coefficient should be statistically different from 1 (since this is exp(0))

Plotting Poisson Regression:

  • Can use either a geom_smooth() or a boxplot()
  • Example of using simulated does data on cancer cells
    • ggplot(data = dat, aes(x = dose, y = cells)) + geom_point()
    • ggplot(data = dat, aes(x = dose, y = cells)) + geom_jitter(width = 0.05, height = 0.05)
    • ggplot(data=dat, aes(x = dose, y = cells)) + geom_jitter(width = 0.05, height = 0.05) geom_smooth()
    • ggplot(data = dat, aes(x = dose, y = cells)) + geom_jitter(width = 0.05, height = 0.05) + geom_smooth(method = ‘glm’, method.args = list(family = ‘poisson’)) # Poisson GLM

Understanding output from logistic regression:

  • Linear model results are the easiest to communicate
  • Poisson model results are multiplicative rather than additive, and are relatively easy to communicate
  • Logistic model results are in log-odds form, which are harder to communicate
    • The odds ratio is p(win) / p(loss)
    • The log-odds are the logit, and are ln( p/(1-p) ), and this is the logit function
    • The odds are exp(log-odds)
    • The odds ratio is exp(Beta-1) - if the odds ratio is 1, there is no impact (odds ratios greater than 1 mean a greater chance of something occuring)
  • Can extract confidence intervals and coefficients from the binomial GLM
    • glmOut <- glm(y ~ x, family = ‘binomial’)
    • coef(glmOut)
    • exp(coef(glmOut))
    • confint(glmOut)
    • exp(confint(glmOut))
    • tidy(glmOut, exponentiate = TRUE, conf.int= TRUE) # get everything at once from the broom::tidy() package

ggplot2 and binomial regression:

  • Can look at plots for the underlying data for a ggplot2
    • ggplot(bus, aes(x = MilesOneWay, y = Bus)) + geom_point()
    • ggJitter <- ggplot(bus, aes(x = MilesOneWay, y = Bus)) + geom_jitter(width = 0, height = 0.05)
    • ggJitter + geom_smooth() # does not work!
    • bus\(Bus2 <- as.numeric(bus\)Bus) - 1 # convert factor to numeric
    • ggJitter + geom_smooth() # still not really right
    • ggJitter + geom_smooth(method = ‘glm’, method.args = list(family = “binomial”)) # much better!
  • Can use graphs to compare probit and logit
    • ggJitter + geom_smooth(method = ‘glm’, method.args = list(family = binomial(link = ‘logit’)), se = FALSE, color = ‘red’) + geom_smooth(method = ‘glm’, method.args = list(family = binomial(link = ‘probit’)), se = FALSE, color = ‘blue’)

Example code includes:

# extract the coeffients from lmOut
(lmCoef <- coef(lmOut))
## (Intercept)      Month2      Month3      Month4      Month5      Month6 
##  0.12947658 -0.03803116 -0.07840132 -0.05725436 -0.03270239 -0.04336547 
##      Month7      Month8      Month9     Month10     Month11     Month12 
## -0.00582067 -0.05151959 -0.02392103 -0.05420777 -0.02392103 -0.02291921
# extract the coefficients from poisosnOut
(poissonCoef <- coef(poissonOut))
## (Intercept)      Month2      Month3      Month4      Month5      Month6 
## -2.04425523 -0.34775767 -0.93019964 -0.58375226 -0.29111968 -0.40786159 
##      Month7      Month8      Month9     Month10     Month11     Month12 
## -0.04599723 -0.50734279 -0.20426264 -0.54243411 -0.20426264 -0.19481645
# take the exponetial using exp()
(poissonCoefExp <- exp(poissonCoef))
## (Intercept)      Month2      Month3      Month4      Month5      Month6 
##   0.1294766   0.7062700   0.3944749   0.5578014   0.7474262   0.6650709 
##      Month7      Month8      Month9     Month10     Month11     Month12 
##   0.9550446   0.6020933   0.8152482   0.5813315   0.8152482   0.8229857
# This is because the Poisson coefficients are multiplicative
# Notice that 0.129 * 0.706 = 0.091 from the Poisson coefficents is the same as 0.129-0.038 = 0.091 from the linear model

cellData <- data.frame(dose=c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10), 
                       cells=c(1, 0, 0, 0, 0, 2, 0, 1, 2, 0, 3, 0, 2, 2, 1, 0, 1, 2, 2, 2, 2, 3, 5, 3, 0, 3, 6, 2, 4, 4, 2, 2, 8, 4, 4, 4, 7, 2, 6, 5, 2, 5, 8, 4, 7, 4, 4, 7, 9, 3, 6, 7, 9, 5, 3, 5, 5, 3, 4, 11, 2, 7, 9, 3, 4, 2, 6, 5, 5, 6, 4, 5, 8, 10, 11, 9, 8, 8, 11, 7, 10, 12, 9, 12, 10, 12, 9, 17, 6, 9, 15, 11, 11, 10, 4, 9, 13, 8, 8, 13)
                       )

# Use geom_smooth to plot a continuous predictor variable
ggplot(data = cellData, aes(x = dose, y = cells)) + 
    geom_jitter(width = 0.05, height = 0.05) + 
    geom_smooth(method = 'glm', method.args = list(family = 'poisson'))

# Extract out the coefficients 
coefOut <- coef(busOut)

# Convert the coefficients to odds-ratios 
exp(coefOut)
## (Intercept) CommuteDays 
##   0.2334164   1.1386623
# use tidy on busOut and exponentiate the results and extract the confidence interval
broom::tidy(busOut, exponentiate=TRUE, conf.int=TRUE)
## # A tibble: 2 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    0.233    0.115     -12.7  7.32e-37    0.186     0.292
## 2 CommuteDays    1.14     0.0231      5.62 1.96e- 8    1.09      1.19
str(bus)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 15892 obs. of  3 variables:
##  $ CommuteDays: num  5 5 5 5 3 4 5 5 5 5 ...
##  $ MilesOneWay: num  19.5 19.5 19.5 19.5 19.5 ...
##  $ Bus        : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 1 1 1 ...
bus <- bus %>%
    mutate(Bus2 = as.integer(Bus)-1)
str(bus)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 15892 obs. of  4 variables:
##  $ CommuteDays: num  5 5 5 5 3 4 5 5 5 5 ...
##  $ MilesOneWay: num  19.5 19.5 19.5 19.5 19.5 ...
##  $ Bus        : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 1 1 1 ...
##  $ Bus2       : num  1 1 1 1 1 1 0 0 0 0 ...
# add in the missing parts of the ggplot
ggJitter <- ggplot(data = bus, aes(x = MilesOneWay, y = Bus2)) + 
    geom_jitter(width = 0, height = 0.05)

# add in geom_smooth()
ggJitter + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# add in the missing parts of the ggplot
ggJitter + geom_smooth(method =  "glm" , method.args = list(family="binomial"))

# add in the missing parts of the ggplot
ggJitter + 
    geom_smooth(method = 'glm', method.args = list(family = binomial(link="probit")), 
                color = 'red', se = FALSE
                ) +
    geom_smooth(method = 'glm', method.args = list(family = binomial(link="logit")), 
                color = 'blue', se = FALSE
                )


Chapter 4 - Multiple Regression with GLMs

Multiple logistic regression:

  • Can use multiple predictors in the logistic regression
    • Risks of over-fitting as the number of predictor variables increases - typical target of observations >= 10*predictors
    • glm(Bus ~ CommuteDay + MilesOneWay, data = bus, family = ‘binomial’)
  • When there is correlation in the predictors, the coefficients may change depending on the order in which they appear in the equation

Formulas in R:

  • The model.matrix() is the cornerstone of the regression process; often run behind the scenes
  • With multiple intercepts, the default is to have the global intercept as the first group, and all other groups being the intercept relative to the reference group

Assumptions of multiple logistic regression:

  • Simpson’s paradox can be a confounder - need to include all the relevant grouping variables
    • Example of the UC Berkeley admission data - key to include the “by department” variable
  • Assumptions of linear and monotonic responses
  • Predictors and response variables should be independent
  • Over-dispersion can cause issues - too many zeroes, too many ones, changing variances over x, etc.

Wrap up:

  • GLM extensions of LM - count data, logits, plotting, etc.

Example code includes:

# Build a logistic regression with Bus predicted by CommuteDays and MilesOneWay
busBoth <- glm(Bus ~ CommuteDays + MilesOneWay, data=bus, family="binomial")

# Look at the summary of the output
summary(busBoth)
## 
## Call:
## glm(formula = Bus ~ CommuteDays + MilesOneWay, family = "binomial", 
##     data = bus)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0732  -0.9035  -0.7816   1.3968   2.5066  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.707515   0.119719  -5.910 3.42e-09 ***
## CommuteDays  0.066084   0.023181   2.851  0.00436 ** 
## MilesOneWay -0.059571   0.003218 -18.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19568  on 15891  degrees of freedom
## Residual deviance: 19137  on 15889  degrees of freedom
## AIC: 19143
## 
## Number of Fisher Scoring iterations: 4
# Build a logistic regression with Bus predicted by CommuteDays
busDays <- glm(Bus ~ CommuteDays, data=bus, family="binomial")

# Build a logistic regression with Bus predicted by MilesOneWay
busMiles <- glm(Bus ~ MilesOneWay, data=bus, family="binomial")


# Build a glm with CommuteDays first and MilesOneWay second
busOne <- glm(Bus ~ CommuteDays + MilesOneWay, data=bus, family="binomial")

# Build a glm with MilesOneWay first and CommuteDays second
busTwo <- glm(Bus ~ MilesOneWay + CommuteDays, data=bus, family="binomial")

# Print model summaries
summary(busOne)
## 
## Call:
## glm(formula = Bus ~ CommuteDays + MilesOneWay, family = "binomial", 
##     data = bus)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0732  -0.9035  -0.7816   1.3968   2.5066  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.707515   0.119719  -5.910 3.42e-09 ***
## CommuteDays  0.066084   0.023181   2.851  0.00436 ** 
## MilesOneWay -0.059571   0.003218 -18.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19568  on 15891  degrees of freedom
## Residual deviance: 19137  on 15889  degrees of freedom
## AIC: 19143
## 
## Number of Fisher Scoring iterations: 4
summary(busTwo)
## 
## Call:
## glm(formula = Bus ~ MilesOneWay + CommuteDays, family = "binomial", 
##     data = bus)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0732  -0.9035  -0.7816   1.3968   2.5066  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.707515   0.119719  -5.910 3.42e-09 ***
## MilesOneWay -0.059571   0.003218 -18.512  < 2e-16 ***
## CommuteDays  0.066084   0.023181   2.851  0.00436 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19568  on 15891  degrees of freedom
## Residual deviance: 19137  on 15889  degrees of freedom
## AIC: 19143
## 
## Number of Fisher Scoring iterations: 4
size <- c(1.1, 2.2, 3.3)
count <- c(10, 4, 2)

# use model matrix with size
model.matrix(~ size)
##   (Intercept) size
## 1           1  1.1
## 2           1  2.2
## 3           1  3.3
## attr(,"assign")
## [1] 0 1
# use model matirx with count
model.matrix(~ size + count)
##   (Intercept) size count
## 1           1  1.1    10
## 2           1  2.2     4
## 3           1  3.3     2
## attr(,"assign")
## [1] 0 1 2
color <- c("red", "blue", "green")

# create a matrix that includes a reference intercept
model.matrix(~ color)
##   (Intercept) colorgreen colorred
## 1           1          0        1
## 2           1          0        0
## 3           1          1        0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$color
## [1] "contr.treatment"
# create a matrix that includes an intercept for each group
model.matrix(~ color - 1)
##   colorblue colorgreen colorred
## 1         0          0        1
## 2         1          0        0
## 3         0          1        0
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$color
## [1] "contr.treatment"
shape <- c("square", "square", "circle")

# create a matrix that includes color and shape  
model.matrix(~ color + shape - 1)
##   colorblue colorgreen colorred shapesquare
## 1         0          0        1           1
## 2         1          0        0           1
## 3         0          1        0           0
## attr(,"assign")
## [1] 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$color
## [1] "contr.treatment"
## 
## attr(,"contrasts")$shape
## [1] "contr.treatment"
# create a matrix that includes shape and color 
model.matrix(~ shape + color - 1)
##   shapecircle shapesquare colorgreen colorred
## 1           0           1          0        1
## 2           0           1          0        0
## 3           1           0          1        0
## attr(,"assign")
## [1] 1 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$shape
## [1] "contr.treatment"
## 
## attr(,"contrasts")$color
## [1] "contr.treatment"
data("UCBAdmissions", package="datasets")
UCBdata <- as.data.frame(UCBAdmissions) %>%
    mutate(Gender=factor(Gender, levels=c("Female", "Male")), Dept=factor(Dept, levels=LETTERS[1:6])) %>%
    tidyr::spread(Admit, Freq) %>%
    arrange(Dept, Gender)

# build a binomial glm where Admitted and Rejected are predicted by Gender
glm1 <- glm(cbind(Admitted, Rejected) ~ Gender, data=UCBdata, family="binomial")
summary(glm1)
## 
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Gender, family = "binomial", 
##     data = UCBdata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -16.7915   -4.7613   -0.4365    5.1025   11.2022  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.83049    0.05077 -16.357   <2e-16 ***
## GenderMale   0.61035    0.06389   9.553   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 877.06  on 11  degrees of freedom
## Residual deviance: 783.61  on 10  degrees of freedom
## AIC: 856.55
## 
## Number of Fisher Scoring iterations: 4
# build a binomial glm where Admitted and Rejected are predicted by Gender and Dept
glm2 <- glm(cbind(Admitted, Rejected) ~ Gender + Dept, data=UCBdata, family="binomial")
summary(glm2)
## 
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Gender + Dept, family = "binomial", 
##     data = UCBdata)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  3.7189  -1.2487   0.2706  -0.0560  -0.9243   1.2533  -0.0858   0.0826  
##       9       10       11       12  
## -0.8509   1.2205   0.2052  -0.2076  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.68192    0.09911   6.880 5.97e-12 ***
## GenderMale  -0.09987    0.08085  -1.235    0.217    
## DeptB       -0.04340    0.10984  -0.395    0.693    
## DeptC       -1.26260    0.10663 -11.841  < 2e-16 ***
## DeptD       -1.29461    0.10582 -12.234  < 2e-16 ***
## DeptE       -1.73931    0.12611 -13.792  < 2e-16 ***
## DeptF       -3.30648    0.16998 -19.452  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 877.056  on 11  degrees of freedom
## Residual deviance:  20.204  on  5  degrees of freedom
## AIC: 103.14
## 
## Number of Fisher Scoring iterations: 4
# Add a non-linear equation to a geom_smooth
ggJitter + 
  geom_smooth(method = 'glm', method.args = list(family = 'binomial'), formula = y~I(x^2), color = 'red')


Introduction to Bioconductor

Chapter 1 - What is Bioconductor?

Introduction to the Bioconductor Project:

  • Bioconductor is open-source software - datasets and packages for analyzing biological data
    • Typically for measuring either the structure or the function (or the interactions of these) for biological elements
  • Bioconductor has its own repository and means of installing packages
  • Bioconductor is constantly in development
    • library(packageName)
    • BiocInstaller::biocVersion()
    • sessionInfo()
    • packageVersion(“packageName”)
    • BiocInstaller::biocValid()

Role of S4 in Bioconductor:

  • R uses S3 and S4; Bioconductor inherits from S4
  • The S3 system is simple and powerful, and is typically used for CRAN
    • S3 often uses generic functions such as plot() and print() which behave differently depending on the object type
  • The S4 system implements OOP by defining objects that are generalized to classes
    • S4 classes have a formal definition and inheritance, making type checking much easier
    • mydescriptor <- new(“GenomeDescription”)
    • The S4 class is more complex than S3; higher setup costs but easier to share and reuse
  • Can check whether an object is from an S4 class
    • isS4(mydescriptor)
    • str(mydescriptor) # will show a formal class with slots
  • The S4 class has a name, slots (methods/fields), and inheritance (often from “contains”)
  • There are S4 accessors for getting the slots (methods/fields)
    • .S4methods(class = “GenomeDescription”)
    • showMethods(classes = “GenomeDescription”, where = search())
    • show(myDescriptor) # sort of like str() but for objects

Biology of Genomic Datasets:

  • Organisms are complex and interconnected and can be unicellular or multicellular
  • All organisms have a genome (complete genetic material, stored mostly in the chromosomes) - the “blueprint”
    • TAGC are the building blocks
  • The genome can be thought of as a genetic DNA alphabet
    • Genes contain heredity instructions
    • Coding genes are expressed through proteins - DNA to RNA (transcription) and RNA to protein (translation)
    • Non-coding genes are not expressed
  • Yeast is a single-cell organism that is frequently used in food and beverage creation
    • library(BSgenome.Scerevisiae.UCSC.sacCer3)
    • yeast <- BSgenome.Scerevisiae.UCSC.sacCer3
    • available.genomes()
    • length(yeast)
    • names(yeast)
    • seqlengths(yeast)
    • getSeq(yeast)
    • getSeq(yeast, “chrM”) # get chromosomes
    • getSeq(yeast, end = 10) # first 10 base pairs of each chromosome

Example code includes:

# Load the BiocInstaller package
library(BiocInstaller)

# Explicit syntax to check the Bioconductor version
BiocInstaller::biocVersion() 

# When BiocInstaller is loaded use biocVersion alone
biocVersion()


# Load the BSgenome package
library(BSgenome)

# Check the version of the BSgenome package
packageVersion("BSgenome")

# Investigate about the a_genome using show()
# show(a_genome)

# Investigate some other accesors
# organism(a_genome)
# provider(a_genome)
# seqinfo(a_genome)


# Load the yeast genome
library(BSgenome.Scerevisiae.UCSC.sacCer3)

# Assign data to the yeastGenome object
yeastGenome <- BSgenome.Scerevisiae.UCSC.sacCer3

# Get the head of seqnames and tail of seqlengths for yeastGenome
head(seqnames(yeastGenome))
tail(seqlengths(yeastGenome))

# Select chromosome M, alias chrM
yeastGenome$chrM

# Count characters of the chrM sequence
nchar(yeastGenome$chrM)


# Assign data to the yeastGenome object
yeastGenome <- BSgenome.Scerevisiae.UCSC.sacCer3

# Get the first 30 bases of each chromosome
getSeq(yeastGenome, start=1, end=30)

Chapter 2 - Biostrings and When to Use Them

Introduction to Biostrings:

  • Biostrings has method for quickly processing biological strings
    • Memory efficient
    • Conatiners that can be inherited
    • showClass(“XString”)
    • showClass(“BString”)
    • showClass(“BStringSet”)
  • There are bases for DNA and RNA available for use
    • DNA_BASES # DNA 4 bases
    • RNA_BASES # RNA 4 bases
    • AA_STANDARD # 20 Amino acids
    • DNA_ALPHABET # contains IUPAC_CODE_MAP
    • RNA_ALPHABET # contains IUPAC_CODE_MAP
    • AA_ALPHABET # contains AMINO_ACID_CODE
  • The general process for gene expression includes
    • Double-strand DNA splits, and is RNA transcribed (T becomes A, A becomes U, C -> G, G -> C)
    • Each three RNA translate to an amino acid
    • dna_seq <- DNAString(“ATGATCTCGTAA”)
    • rna_seq <- RNAString(dna_seq)
    • rna_seq # will give seq: AUGAUCUCGUAA
    • aa_seq <- translate(rna_seq) # Translation RNA to AA
    • aa_seq # seq: MIS*
    • translate(dna_seq) # translate() also goes directly from DNA to AA (shortcut to the RNA and translate process)

Sequence handling:

  • Can use XString to store a single sequence
  • Can use XStrinSet to store multiple sequences, each of varying lengths
    • zikaVirus <- readDNAStringSet(“data/zika.fa”)
    • zikaVirus_seq <- unlist(zikaVirus) # to collate the sequence use unlist
    • zikaSet <- DNAStringSet(zikaVirus_seq, start = c(1, 101, 201), end = c(100, 200, 300)) # to create a new set from a single sequence
    • complement(a_seq) # the complementary sequence
  • Can use rev to reverse a sequence
    • rev(zikaShortSet) # the last list will become first
    • reverse(zikaShortSet) # reverse from right to left for each of the sequences in the set
    • reverseComplement(rna_seq) # same as reverseComplement(rna_seq) but more memory efficient

Why we are interested in patterns:

  • Can learn more about patterns using sequencing - frequency, occurences, etc.
  • Can use Biostring string matching functions
    • matchPattern(pattern, subject) # 1 string to 1 string
    • vmatchPattern(pattern, subject) # 1 set of strings to 1 string OR 1 string to a set of strings
  • Palindromes can be important in biology - binding sites
    • findPalindromes() # find palindromic regions in a single sequence
  • There are six possibilities with translation based on the start of the sequence - reverseComplements and amino acids (based on 3 bases) depending on where the window starts
    • [1] 30 ACATGGGCCTACCATGGGAGCTACGAAGCC # original sequence
    • 6 possible reading frames, DNAStringSet

    • [1] 30 ACATGGGCCTACCATGGGAGCTACGAAGCC + 1
    • [2] 30 GGCTTCGTAGCTCCCATGGTAGGCCCATGT - 1
    • [3] 29 CATGGGCCTACCATGGGAGCTACGAAGCC + 2
    • [4] 29 GCTTCGTAGCTCCCATGGTAGGCCCATGT - 2
    • [5] 28 ATGGGCCTACCATGGGAGCTACGAAGCC + 3
    • [6] 28 CTTCGTAGCTCCCATGGTAGGCCCATGT - 3

Example code includes:

# Load packages
library(Biostrings)

# Check the alphabet of the zikaVirus
alphabet(zikaVirus)

# Check the alphabetFrequency of the zikaVirus
alphabetFrequency(zikaVirus)

# Check alphabet of the zikaVirus using baseOnly = TRUE
alphabet(zikaVirus, baseOnly = TRUE)


# Unlist the set and select the first 21 letters as dna_seq, then print it
dna_seq <- DNAString(subseq(as.character(zikaVirus), end = 21))
dna_seq

# 1.1 Transcribe dna_seq as rna_seq, then print it
rna_seq <- RNAString(dna_seq) 
rna_seq

# 1.2 Translate rna_seq as aa_seq, then print it
aa_seq <- translate(rna_seq)
aa_seq

# 2.1 Translate dna_seq as aa_seq_2, then print it
aa_seq_2 <- translate(dna_seq)
aa_seq_2


# Create zikv with one collated sequence using `zikaVirus`
zikv <- unlist(zikaVirus)

# Check the length of zikaVirus and zikv
length(zikaVirus)
length(zikv)

# Check the width of zikaVirus
width(zikaVirus)

# Subset zikv to only the first 30 bases
subZikv <- subseq(zikv, end = 30)
subZikv


# The reverse of zikv is
reverse(zikv)

# The complement of zikv is
complement(zikv)

# The reverse complement of zikv is
reverseComplement(zikv)

# The translation of zikv is
translate(zikv)


# Find palindromes in zikv
findPalindromes(zikv)


# print the rnaframesZikaSet 
rnaframesZikaSet

# translate all 6 reading frames 
AAzika6F <- translate(rnaframesZikaSet)
AAzika6F

# Count the matches allowing 15 mistmatches
vcountPattern(pattern = ns5, subject = AAzika6F, max.mismatch = 15)

# Select the frame that contains the match
selectedSet <- AAzika6F[3]

#Convert this frame into a single sequence
selectedSeq <- unlist(selectedSet)


# Use vmatchPattern with the set
vmatchPattern(pattern = ns5, subject = selectedSet, max.mismatch = 15)

# Use matchPattern with the single sequence
matchPattern(pattern = ns5, subject = selectedSeq, max.mismatch = 15)

Chapter 3 - IRanges and GenomicRanges

IRanges and Genomic Structures:

  • Can sequence millions of genes for cheap, so there is need for analyzing large sequence data
  • Sequence ranges are a core component of the analysis
    • library(IRanges)
    • myIRanges <- IRanges(start = 20, end = 30)
    • (myIRanges_width <- IRanges(start = c(1, 20), width = c(30, 11)))
    • (myIRanges_end <- IRanges(start = c(1, 20), end = 30))
    • Note that width = end - start + 1
  • Can also use RLE - run length encoding
    • General S4 containers for saving large and repetitive vectors
    • (some_numbers <- c(3, 2, 2, 2, 3, 3, 4, 2))
    • (Rle(some_numbers)) # numeric-Rle of length 8 with 5 runs
  • Can also create using logical vectors for keep or skip
    • IRanges(start = c(FALSE, FALSE, TRUE, TRUE)) # will pull items 3 and 4 for a range of width 2
    • gi <- c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE)
    • myRle <- Rle(logi)
  • The Irange is hierarchical and can hold metadata

Gene of Interest:

  • Genomic sequences can be split over numerous chromosomes; may want to extract with sequential ranges
    • library(GenomicRanges)
    • (myGR <- GRanges(“chr1:200-300”)) # name:start-end as a character (each range is associated to a chromosome)
    • methods(class = “GRanges”) # to check available accessors
    • seqnames(gr) # used for chromosome names
    • ranges(gr) # returns an IRanges object for ranges
    • mcols(gr) # stores metadata columns
    • seqinfo(gr) # generic function to store sequence information
    • genome(gr) # stores the genome name
  • Accessors can be inherited across S4 classes
  • One of the genes of interest is ABCD1 (end of the X chromosome long arm)
    • library(TxDb.Hsapiens.UCSC.hg38.knownGene)
    • hg <- TxDb.Hsapiens.UCSC.hg38.knownGene
    • hg_chrXg <- genes(hg, filter = list(tx_chrom = c(“chrX”)))

Manipulating collections of GRanges:

  • The GRangesList-class is a container for storing a collection of GRanges
    • as(mylist, “GRangesList”)
    • GRangesList(myGranges1, myGRanges2, …)
    • unlist(myGRangesList) # convert back to Granges
    • methods(class = “GRangesList”)
  • Multiple GRanges objects may be combined into a GRangesList
    • GRanges in a list will be taken as compound features of a larger object
    • transcripts by gene, exons by transcripts, read alignments, sliding windows
  • Can break a region in to smaller regions
    • hg_chrX
    • slidingWindows(hg_chrX, width = 20000, step = 10000) # there is overlap of 10000 and the last range will (typically) be shorter
  • Can grab known genomic features
    • library(TxDb.Hsapiens.UCSC.hg38.knownGene)
    • (hg <- TxDb.Hsapiens.UCSC.hg38.knownGene)
  • Can then extract gebomic features
    • seqlevels(hg) <- c(“chrX”)
    • transcripts(hg, columns = c(“tx_id”, “tx_name”), filter = NULL)
    • exons(hg, columns = c(“tx_id”, “exon_id”), filter = list(tx_id = “179161”))
    • exonsBytx <- exonsBy(hg, by = “tx”) # exons by transcript
    • abcd1_179161 <- exonsBytx[[“179161”]] # transcript id
  • Can also find genes of interest in the overlaps
    • countOverlaps(query, subject)
    • findOverlaps(query, subject)
    • subsetByOverlaps(query, subject)

Example code includes:

# load package IRanges
library(IRanges)

# start vector 1 to 5 and end 100 
IRnum1 <- IRanges(start=1:5, end=100)

# end 100 and width 89 and 10
IRnum2 <- IRanges(end=100, width=c(89, 10))

# logical argument start = Rle(c(F, T, T, T, F, T, T, T))
IRlog1 <- IRanges(start = Rle(c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)))

# Printing objects in a list
print(list(IRnum1 = IRnum1, IRnum2 = IRnum2, IRlog1 = IRlog1))


# Load Package Genomic Ranges
library(GenomicRanges)

# Print the GRanges object
myGR

# Check the metadata, if any
mcols(myGR)


# load human reference genome hg38
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

# assign hg38 to hg, then print it
hg <- TxDb.Hsapiens.UCSC.hg38.knownGene
hg

# extract all the genes in chromosome X as hg_chrXg, then print it
hg_chrXg <- genes(hg, filter = list(tx_chrom = c("chrX")))
hg_chrXg

# extract all positive stranded genes in chromosome X as hg_chrXgp, then sort it
hg_chrXgp <- genes(hg, filter = list(tx_chrom = c("chrX"), tx_strand = "+"))
sort(hg_chrXgp)


# load the human transcripts DB to hg
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
hg <- TxDb.Hsapiens.UCSC.hg38.knownGene

# prefilter chromosome X
seqlevels(hg) <- c("chrX")

# get all transcripts by gene
hg_chrXt <- transcriptsBy(hg, by="gene")

# select gene `215` from the transcripts
hg_chrXt[[215]]


# load the human transcripts DB to hg
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
hg <- TxDb.Hsapiens.UCSC.hg38.knownGene

# prefilter chromosome X
seqlevels(hg) <- c("chrX")

# get all transcripts by gene
hg_chrXt <- transcriptsBy(hg, by="gene")

# select gene `215` from the transcripts
hg_chrXt[['215']]


# Store the overlapping range in rangefound
rangefound <- subsetByOverlaps(hg_chrX, ABCD1)

# Check names of rangefound
names(rangefound)

# Check the geneOfInterest 
ABCD1

# Check rangefound
rangefound

Chapter 4 - Introducing ShortRead

Sequence Files:

  • Plant genomes tend to be large datasets; one of the plants has 135 million base pairs
  • Can store with text formats fastq (quality encoding per sequence letter) and fasta
  • The fastq format includes
    • @ unique sequence identifier
    • raw sequence string
      • optional id
    • quality encoding per sequence letter
  • The fasta (fasta, fa, seq) format is shorter and includes
    • unique sequence identifier

    • raw sequence string
  • Can read in the fasta files using
    • library(ShortRead)
    • fasample <- readFasta(dirPath = “data/”, pattern = “fasta”) # print fasample
    • methods(class = “ShortRead”)
    • writeFasta(fasample, file = “data/sample.fasta”)
  • Can read in the fastq files using
    • fqsample <- readFastq(dirPath = “data/”, pattern = “fastq”)
    • methods(class = “ShortReadQ”)
    • writeFastq(fqsample, file = “data/sample.fastq.gz”)
  • Sometimes valuable to set the seed and run a FastqSampler()
    • set.seed(123)
    • sampler <- FastqSampler(“data/SRR1971253.fastq”, 500)
    • sample_small <- yield(sampler)

Sequence Quality:

  • Quality scores are lograithmic somewhat like earthquake magnitudes
    • 10 = 1 in 10 wrong
    • 20 = 1 in 100 wrong
    • 30 = 1 in 1,000 wrong (often considered to be the cutoff for “good”)
    • 50 = 1 in 100,000 wrong
    • etc.
    • encoding(ShortRead::quality(fqsample))
    • ShortRead::quality(fqsample)
    • sread(fqsample)[1]
    • quality(fqsample)[1] # quality encoding values
    • pq <- PhredQuality(quality(fqsample)) ## PhredQuality instance
    • qs <- as(pq, “IntegerList”) # transform encoding into scores
  • Can run a quality assessment process on files that have been read in
    • qaSummary <- qa(fqsample, lane = 1) # optional lane
    • names(qaSummary)
    • browseURL(report(qaSummary)) # HTML report
    • alphabet(sread(fullSample))
    • abc <- alphabetByCycle(sread(fullSample))
    • nucByCycle <- t(abc[1:4,])
    • nucByCycle <- nucByCycle %>%
    • as.tibble() %>% # convert to tibble
    • mutate(cycle = 1:50) # add cycle numbers

Match and Filter:

  • Can run the match and filter process while reading in a large file
  • Duplicate sequences can result from a biological process, PCR amplification, sequencing more than onbce, etc.
    • Duplicates should generally be marked, and an acceptable threshhold set
    • table(srduplicated(dfqsample))
    • cleanReads <- mydReads[srduplicated(mydReads) == FALSE]
    • table(srduplicated(cleanReads))
  • Can create your own filters using srFilter
    • readWidthCutOff <- srFilter(function(x) {width(x) >= minWidth}, name = “MinWidth”)
    • minWidth <- 51
    • fqsample[readWidthCutOff(fqsample)]
    • myFilter <- nFilter(threshold = 10, .name = “cleanNFilter”)
    • filtered <- readFastq(dirPath = “data”, pattern = “.fastq”, filter = myFilter)
  • Can also use idFilter and polynFilter
    • myFilterID <- idFilter(regex = “:3:1”)
    • filtered <- readFastq(dirPath = “data”, pattern = “.fastq”, filter = myFilterID)
    • myFilterPolyA <- polynFilter(threshold = 10, nuc = c(“A”))
    • filtered[myFilterPolyA(filtered)]

Multiple Assessment:

  • Desire to save time and resources when reading in high volume data - parallel processing
    • library(Rqc) # uses many of the basic Bioconductor packages as well as some of the basic CRAN packages
    • qaRqc <- rqcQA(files) # pass a file list as the files object
    • class(qaRqc) # “list”
    • names(qaRqc) # name of the input files
    • qaRqc[1]
    • qaRqc <- rqcQA(files, workers = 4)) # run it in parallel, saving only the quality assessment
  • Can also save a sample of the parallel reads, using a seed for reproducibility
    • set.seed(1111)
    • qaRqc_sample <- rqcQA(files, workers = 4, sample = TRUE, n = 500))
  • Can also build paired files using the pair= argument
    • pfiles <- “data/seq_11.fq” “data/seq1_2.fq” “data/seq2_1.fq” “data/seq2_2.fq”
    • qaRqc_paired <- rqcQA(pfiles, workers = 4, pair = c(1, 1, 2, 2)))
    • reportFile <- rqcReport(qaRqc, templateFile = “myReport.Rmd”)
    • browseURL(reportFile)
    • qaRqc <- rqcQA(files, workers = 4))
    • perFileInformation(qaRqc)

Introduction to Bioconductor:

  • Installing packages from Bioconductor
  • Basic techniques for reading, manipulating, filtering, raw genomic data
  • BSGenome and TxDb built-in datasets
  • Check the quality of sequence files using ShortRead and Rqc
  • Explored variety of organisms

Example code includes:

# load ShortRead
library(ShortRead)

# print fqsample
fqsample

# class of fqsample
class(fqsample)

# class sread fqsample
class(sread(fqsample))

# id fqsample
id(fqsample)


qaSummary <- qa(fqsample, type = "fastq", lane = 1)

# load ShortRead
library(ShortRead)

# Check quality
quality(fqsample)

# Check encoding
encoding(quality(fqsample))

# Check baseQuality
qaSummary[["baseQuality"]]


# glimpse nucByCycle
glimpse(nucByCycle)

# make an awesome plot!
nucByCycle %>% 
  # gather the nucleotide letters in alphabet and get a new count column
  gather(key = alphabet, value = count , -cycle) %>% 
  ggplot(aes(x = cycle, y =  count, colour = alphabet)) +
  geom_line(size = 0.5 ) +
  labs(y = "Frequency") +
  theme_bw() +
  theme(panel.grid.major.x = element_blank())


myStartFilter <- srFilter(function(x) substr(sread(x), 1, 5) == "ATGCA")

# Load package ShortRead
library(ShortRead)

# Check class of fqsample
class(fqsample)

# filter reads into selectedReads using myStartFilter
selectedReads <- fqsample[myStartFilter(fqsample)]

# Check class of selectedReads
class(selectedReads)

# Check detail of selectedReads
detail(selectedReads)


# Load package Rqc
library(Rqc)

# Average per cycle quality plot
rqcCycleAverageQualityPlot(qa)

# Average per cycle quality plot with white background
rqcCycleAverageQualityPlot(qa) + theme_minimal()

# Read quality plot with white background
rqcReadQualityPlot(qa) + theme_minimal()

Non-Linear Modeling in R with GAM

Chapter 1 - Introduction to Generalized Additive Models

Introduction:

  • There are trade-offs between model power (e.g., ML) and parsimony (e.g., linear regression), with GAM being a middle-ground solution
  • GAM allows for flexibly modeling non-linear relationships
    • linear_mod <- lm(y ~ x, data = my_data)
    • library(mgcv)
    • gam_mod <- gam(y ~ s(x), data = my_data) # s() is the smoothing function
  • The flexible smooth is built up from simpler basis functions
    • The overall smooth is the sum of the simpler basis functions
    • coef(gam_mod)

Basis functions and smoothing:

  • Because the GAM often has many basis coefficients, there is a meaningful risk of over-fitting
    • Fit = Likelihood - lambda * Wiggliness (lambda is the smoothing parameter, and is optimized while R fits the data to the GAM)
  • Can fit the smoothing parameter using arguments in the gam() function call
    • gam(y ~ s(x), data = dat, sp = 0.1) # global sp argument
    • gam(y ~ s(x, sp = 0.1), data = dat) # sp argument specific to a term
    • gam(y ~ s(x), data = dat, method = “REML”) # select smoothing using Restricted Maximum Likelihood
  • The number of basis functions also drives both fit to the the training data and risk of over-fitting
    • gam(y ~ s(x, k = 3), data = dat, method = “REML”)
    • gam(y ~ s(x, k = 10), data = dat, method = “REML”) # even if the k is “too high”, the REML will help prevent overfits
    • gam(y ~ s(x), data = dat, method = “REML”) # defaults

Multivariate GAMs:

  • Can run the GAM using multiple independent variables - smooths, categoricals, etc.
  • Can start with a simple model
    • model <- gam(hw.mpg ~ s(weight), data = mpg, method = “REML”)
  • Can then extend the simple model to include additional predictors
    • model2 <- gam(hw.mpg ~ s(weight) + s(length), data = mpg, method = “REML”)
    • model2 <- gam(hw.mpg ~ s(weight) + length, data = mpg, method = “REML”) # combined linear and non-linear terms since length is not enclosed in an s()
    • model2b <- gam(hw.mpg ~ s(weight) + s(length, sp = 1000), data = mpg, method = “REML”) # will be linear since there is strong smoothing due to sp=1000
  • Linear terms are especially valuable with categorical predictors
    • model3 <- gam(hw.mpg ~ s(weight) + fuel, data = mpg, method = “REML”) # fuel needs to be a factor; mgcv will not handle characters
    • model4 <- gam(hw.mpg ~ s(weight, by = fuel), data = mpg, method = “REML”) # smooths will be by categories of fuel
    • model4b <- gam(hw.mpg ~ s(weight, by = fuel) + fuel, data = mpg, method = “REML”) # add intercept as well as smooth by vategory of fuel

Example code includes:

data(mcycle, package="MASS")

# Examine the mcycle data frame
head(mcycle)
##   times accel
## 1   2.4   0.0
## 2   2.6  -1.3
## 3   3.2  -2.7
## 4   3.6   0.0
## 5   4.0  -2.7
## 6   6.2  -2.7
plot(mcycle)

# Fit a linear model
lm_mod <- lm(accel ~ times, data = mcycle)

# Visualize the model
termplot(lm_mod, partial.resid = TRUE, se = TRUE)

# Load mgcv
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:forecast':
## 
##     getResponse
## The following object is masked from 'package:lme4':
## 
##     lmList
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-27. For overview type 'help("mgcv-package")'.
# Fit the model
gam_mod <- gam(accel ~ s(times), data = mcycle)

# Plot the results
plot(gam_mod, residuals = TRUE, pch = 1)

# Extract the model coefficients
coef(gam_mod)
## (Intercept)  s(times).1  s(times).2  s(times).3  s(times).4  s(times).5 
##  -25.545865  -63.718008   43.475644 -110.350132  -22.181006   35.034423 
##  s(times).6  s(times).7  s(times).8  s(times).9 
##   93.176458   -9.283018 -111.661472   17.603782
# Fit a GAM with 3 basis functions
gam_mod_k3 <- gam(accel ~ s(times, k = 3), data = mcycle)

# Fit with 20 basis functions
gam_mod_k20 <- gam(accel ~ s(times, k = 20), data = mcycle)

# Visualize the GAMs
par(mfrow = c(1, 2))
plot(gam_mod_k3, residuals = TRUE, pch = 1)
plot(gam_mod_k20, residuals = TRUE, pch = 1)

par(mfrow = c(1, 1))


# Extract the smoothing parameter
gam_mod <- gam(accel ~ s(times), data = mcycle, method = "REML")
gam_mod$sp
##     s(times) 
## 0.0007758036
# Fix the smoothing paramter at 0.1
gam_mod_s1 <- gam(accel ~ s(times), data = mcycle, sp = 0.1)

# Fix the smoothing paramter at 0.0001
gam_mod_s2 <- gam(accel ~ s(times), data = mcycle, sp = 0.0001)

# Plot both models
par(mfrow = c(2, 1))
plot(gam_mod_s1, residuals = TRUE, pch = 1)
plot(gam_mod_s2, residuals = TRUE, pch = 1)

par(mfrow = c(1, 1))


# Fit the GAM
gam_mod_sk <- gam(accel ~ s(times, k=50), sp=0.0001, data=mcycle)

#Visualize the model
plot(gam_mod_sk, residuals = TRUE, pch = 1)

data(mpg, package="gamair")

# Examine the data
head(mpg)
##   symbol loss        make fuel aspir doors       style drive eng.loc   wb
## 1      3   NA alfa-romero  gas   std   two convertible   rwd   front 88.6
## 2      3   NA alfa-romero  gas   std   two convertible   rwd   front 88.6
## 3      1   NA alfa-romero  gas   std   two   hatchback   rwd   front 94.5
## 4      2  164        audi  gas   std  four       sedan   fwd   front 99.8
## 5      2  164        audi  gas   std  four       sedan   4wd   front 99.4
## 6      2   NA        audi  gas   std   two       sedan   fwd   front 99.8
##   length width height weight eng.type cylinders eng.cc fuel.sys bore
## 1  168.8  64.1   48.8   2548     dohc      four    130     mpfi 3.47
## 2  168.8  64.1   48.8   2548     dohc      four    130     mpfi 3.47
## 3  171.2  65.5   52.4   2823     ohcv       six    152     mpfi 2.68
## 4  176.6  66.2   54.3   2337      ohc      four    109     mpfi 3.19
## 5  176.6  66.4   54.3   2824      ohc      five    136     mpfi 3.19
## 6  177.3  66.3   53.1   2507      ohc      five    136     mpfi 3.19
##   stroke comp.ratio  hp  rpm city.mpg hw.mpg price
## 1   2.68        9.0 111 5000       21     27 13495
## 2   2.68        9.0 111 5000       21     27 16500
## 3   3.47        9.0 154 5000       19     26 16500
## 4   3.40       10.0 102 5500       24     30 13950
## 5   3.40        8.0 115 5500       18     22 17450
## 6   3.40        8.5 110 5500       19     25 15250
str(mpg)
## 'data.frame':    205 obs. of  26 variables:
##  $ symbol    : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ loss      : int  NA NA NA 164 164 NA 158 NA 158 NA ...
##  $ make      : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
##  $ fuel      : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspir     : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
##  $ doors     : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
##  $ style     : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
##  $ drive     : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
##  $ eng.loc   : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wb        : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ length    : num  169 169 171 177 177 ...
##  $ width     : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ height    : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ weight    : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ eng.type  : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
##  $ cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
##  $ eng.cc    : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuel.sys  : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ bore      : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke    : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ comp.ratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ hp        : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ rpm       : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ city.mpg  : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ hw.mpg    : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price     : int  13495 16500 16500 13950 17450 15250 17710 18920 23875 NA ...
# Fit the model
mod_city <- gam(city.mpg ~ s(weight) + s(length) + s(price), data = mpg, method = "REML")

# Plot the model
plot(mod_city, pages = 1)

# Fit the model
mod_city2 <- gam(city.mpg ~ s(weight) + s(length) + s(price) + fuel + drive + style, data = mpg, method = "REML")

# Plot the model
plot(mod_city2, all.terms = TRUE, pages = 1)

# Fit the model
mod_city3 <- gam(city.mpg ~ s(weight, by=drive) + s(length, by=drive) + s(price, by=drive) + drive, 
                 data = mpg, method = "REML"
                 )

# Plot the model
plot(mod_city3, pages = 1)


Chapter 2 - Interpreting and Visualizing GAMs

Interpreting GAM Outputs:

  • Can get summaries of model output by way of the summary() function
    • mod_hwy <- gam(hw.mpg ~ s(weight) + s(rpm) + s(price) + s(comp.ratio) + s(width) + fuel + cylinders, data = mpg, method = “REML”)
    • summary(mod_hwy)
    • The first component of the summary shows the model family, link, and formula
    • The second component shows the parametric coefficients and significances
    • The third components shows the approximate significance of the smooths # edf of 1 is equivalent to a straight line, edf of 2 is equivalent to a parabola, etc.
    • Generally, a significant smooth can be thought of as one where a straight, horizontal line cannot be drawn through the 95% confidence interval

Visualizing GAMs:

  • Visualizations are a powerful way to inspect and communicate results
    • ?plot.gam
  • The mgcv plots are “partial effects” plots - the components that add up to the overall model
    • plot(gam_model, select = c(2, 3)) # default is that all smoothed terms are selected, can override with select
    • plot(gam_model, pages = 1) # default is as many pages as needed
    • plot(gam_model, pages = 1, all.terms = TRUE) # The all.terms will show the non-smoothed terms also
    • plot(gam_model, residuals = TRUE) # show the partial residuals
    • plot(gam_model, rug = TRUE) # show the rug on the x-axis
    • plot(gam_model, rug = TRUE, residuals = TRUE, pch = 1, cex = 1) # pch for shape and cex for size
    • plot(gam_model, shade = TRUE) # 95% CI is shaded rather than shown in dotted lines
    • plot(gam_model, shade = TRUE, shade.col = “lightblue”) # color of shading
    • plot(gam_model, seWithMean = TRUE)
    • plot(gam_model, seWithMean = TRUE, shift = coef(gam_model)[1]) # brings in the intercept, which is the first coefficient of the model

Model checking with gam.check():

  • There are many potential pitfalls to be checked
    • Inadequate basis number
  • There is an automated call to look at the model results
    • gam.check(mod)
    • Convergence - if it has not converged, it is likely wrong
    • The p-values for the residuals - should not be significant, though this is only an approximate test
    • Standard regression residuals plots

Checking concurvity:

  • There can be collinearity concerns with a linear model, which can result in poor models with large confidence intervals
  • The GAM can have a concurvity concern, where one variable is a smooth of another variable (such as x and x**2)
    • concurvity(m1, full = TRUE)
    • Generally, a concurvity worst case of 0.8+ is an area for concern
    • concurvity(m1, full = FALSE) # get the pairwise concurvities

Example code includes:

# Fit the model
mod_city4 <- gam(city.mpg ~ s(weight) + s(length) + s(price) + s(rpm) + s(width),
                 data = mpg, method = "REML")

# View the summary
summary(mod_city4)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## city.mpg ~ s(weight) + s(length) + s(price) + s(rpm) + s(width)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.201      0.188     134   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##             edf Ref.df      F  p-value    
## s(weight) 5.620  6.799 17.524  < 2e-16 ***
## s(length) 2.943  3.759  0.904    0.420    
## s(price)  1.000  1.000 16.647 6.68e-05 ***
## s(rpm)    7.751  8.499 16.486  < 2e-16 ***
## s(width)  1.003  1.005  0.006    0.939    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.831   Deviance explained = 84.7%
## -REML = 496.47  Scale est. = 7.0365    n = 199
# Fit the model
mod <- gam(accel ~ s(times), data = mcycle, method = "REML")

# Make the plot with residuals
plot(mod, residuals=TRUE)

# Change shape of residuals
plot(mod, residuals=TRUE, pch=1, cex=1)

# Fit the model
mod <- gam(hw.mpg ~ s(weight) + s(rpm) + s(price) + comp.ratio, 
           data = mpg, method = "REML")

# Plot the price effect
plot(mod, select=c(3))

# Plot all effects
plot(mod, all.terms=TRUE, pages=1)

# Plot the weight effect with colored shading
plot(mod, select = 1, shade=TRUE, shade.col="hotpink")

# Add the intercept value and uncertainty
plot(mod, select = 1, shade=TRUE, shade.col="hotpink", seWithMean=TRUE, shift=coef(mod)[1])

dat <- data.frame(y=c(11.17, 2.81, 12.9, 5.68, 5.58, -1.09, 5.42, 12.13, 4.73, 6.29, 5.74, 8.32, 9.76, 4.78, 9.08, 10.5, 9.4, 9.51, 14.58, 13.84, 4.01, 3.31, 5.32, 6.6, 10.54, 13.19, 10.06, 8.6, -0.62, 4.78, 5.98, 2.75, 1.36, 8.51, 8.12, 4.18, 10.65, 5.92, -0.03, 6.48, 9.12, 6.57, 15.38, 11.76, 7.47, 12, 3.4, 3.39, 0.95, 5.49, 7.92, 8.04, 8.81, 6.65, 8.93, 0.55, 6.73, 3.38, 4.42, 8.23, 12.2, 14.45, 2.82, 5.58, 8.74, 14.14, 5.74, 4.59, 14.54, 6.65, 4.21, 8.71, 1.76, 6.22, 8.87, 10.3, 9.18, 5.05, 5.44, 4.86, 3.25, 4.59, 12.01, 6.69, 6.3, 6.85, 5.45, 15.43, -0.9, 3.43, 9.83, 1.04, 1.16, 16.7, 9.16, 8.46, 7.81, 4.97, 7.46, 1.49, 8.01, 9.48, 9.43, 3.92, 6.2, 7.63, 8.56, 11.53, 9.98, 2.49, 5.67, 3.48, 7.92, 8.62, 7.44, 6.35, 10.88, 9.74, 3.79, 15.43, 6.56, 2.5, 6.66, 9.75, 12.72, 14.64, 8.9, 10.74, 5.93, 2.53, 3.69, 15.25, 0.5, 11.8, 13.19, 6.05, -1.26, 9.09, 9.78, 7.23, 11.67, 12.54, -0.36, 9.4, 7.87, 13.46, 9.33, 2.55, 9.23, 5.95, 10.46, 3.39, 3.81, 7.25, 3.94, 10.18, 8.63, 11.51, 2.42, 9.44, 5.95, 7.75, 10.16, 16.11, 5.16, 3.13, 7.75, 9.96, 7.27, 14.62, 3.88, 10.2, 5.86, 16.18, 5.4, 1.55, 2.91, 9.16, 9.77, 2.25, 5.01, 8.79, 3.34, 7.09, 8.18, 3.34, 8.02, 8.12, 6.69, 3.22, 8.15, 5.01, 11.51, 6.62, 7.07, 0.52, 10.26, 7.99, 8.98, 9.87), 
                  x0=c(0.9, 0.27, 0.37, 0.57, 0.91, 0.2, 0.9, 0.94, 0.66, 0.63, 0.06, 0.21, 0.18, 0.69, 0.38, 0.77, 0.5, 0.72, 0.99, 0.38, 0.78, 0.93, 0.21, 0.65, 0.13, 0.27, 0.39, 0.01, 0.38, 0.87, 0.34, 0.48, 0.6, 0.49, 0.19, 0.83, 0.67, 0.79, 0.11, 0.72, 0.41, 0.82, 0.65, 0.78, 0.55, 0.53, 0.79, 0.02, 0.48, 0.73, 0.69, 0.48, 0.86, 0.44, 0.24, 0.07, 0.1, 0.32, 0.52, 0.66, 0.41, 0.91, 0.29, 0.46, 0.33, 0.65, 0.26, 0.48, 0.77, 0.08, 0.88, 0.34, 0.84, 0.35, 0.33, 0.48, 0.89, 0.86, 0.39, 0.78, 0.96, 0.43, 0.71, 0.4, 0.33, 0.76, 0.2, 0.71, 0.12, 0.25, 0.14, 0.24, 0.06, 0.64, 0.88, 0.78, 0.8, 0.46, 0.41, 0.81, 0.6, 0.65, 0.35, 0.27, 0.99, 0.63, 0.21, 0.13, 0.48, 0.92, 0.6, 0.98, 0.73, 0.36, 0.43, 0.15, 0.01, 0.72, 0.1, 0.45, 0.64, 0.99, 0.5, 0.48, 0.17, 0.75, 0.45, 0.51, 0.21, 0.23, 0.6, 0.57, 0.08, 0.04, 0.64, 0.93, 0.6, 0.56, 0.53, 0.99, 0.51, 0.68, 0.6, 0.24, 0.26, 0.73, 0.45, 0.18, 0.75, 0.1, 0.86, 0.61, 0.56, 0.33, 0.45, 0.5, 0.18, 0.53, 0.08, 0.28, 0.21, 0.28, 0.9, 0.45, 0.78, 0.88, 0.41, 0.06, 0.34, 0.72, 0.34, 0.63, 0.84, 0.86, 0.39, 0.38, 0.9, 0.64, 0.74, 0.61, 0.9, 0.29, 0.19, 0.89, 0.5, 0.88, 0.19, 0.76, 0.72, 0.94, 0.55, 0.71, 0.39, 0.1, 0.93, 0.28, 0.59, 0.11, 0.84, 0.32),
                  x1=c(0.78, 0.27, 0.22, 0.52, 0.27, 0.18, 0.52, 0.56, 0.13, 0.26, 0.72, 0.96, 0.1, 0.76, 0.95, 0.82, 0.31, 0.65, 0.95, 0.95, 0.34, 0.26, 0.17, 0.32, 0.51, 0.92, 0.51, 0.31, 0.05, 0.42, 0.85, 0.35, 0.13, 0.37, 0.63, 0.39, 0.69, 0.69, 0.55, 0.43, 0.45, 0.31, 0.58, 0.91, 0.14, 0.42, 0.21, 0.43, 0.13, 0.46, 0.94, 0.76, 0.93, 0.47, 0.6, 0.48, 0.11, 0.25, 0.5, 0.37, 0.93, 0.52, 0.32, 0.28, 0.79, 0.7, 0.17, 0.06, 0.75, 0.62, 0.17, 0.06, 0.11, 0.38, 0.17, 0.3, 0.19, 0.26, 0.18, 0.48, 0.77, 0.03, 0.53, 0.88, 0.37, 0.05, 0.14, 0.32, 0.15, 0.13, 0.22, 0.23, 0.13, 0.98, 0.33, 0.51, 0.68, 0.1, 0.12, 0.05, 0.93, 0.67, 0.09, 0.49, 0.46, 0.38, 0.99, 0.18, 0.81, 0.07, 0.4, 0.14, 0.19, 0.84, 0.72, 0.27, 0.5, 0.08, 0.35, 0.97, 0.62, 0.66, 0.31, 0.41, 1, 0.86, 0.95, 0.81, 0.78, 0.27, 0.76, 0.99, 0.29, 0.4, 0.81, 0.08, 0.36, 0.44, 0.16, 0.58, 0.97, 0.99, 0.18, 0.54, 0.38, 0.68, 0.27, 0.47, 0.17, 0.37, 0.73, 0.49, 0.06, 0.78, 0.42, 0.98, 0.28, 0.85, 0.08, 0.89, 0.47, 0.11, 0.33, 0.84, 0.28, 0.59, 0.84, 0.07, 0.7, 0.7, 0.46, 0.44, 0.56, 0.93, 0.23, 0.22, 0.42, 0.33, 0.86, 0.18, 0.49, 0.43, 0.56, 0.66, 0.98, 0.23, 0.24, 0.8, 0.83, 0.11, 0.96, 0.15, 0.14, 0.93, 0.51, 0.15, 0.35, 0.66, 0.31, 0.35), 
                  x2=c(0.15, 0.66, 0.19, 0.95, 0.9, 0.94, 0.72, 0.37, 0.78, 0.01, 0.94, 0.99, 0.36, 0.75, 0.79, 0.71, 0.48, 0.49, 0.31, 0.7, 0.82, 0.43, 0.51, 0.66, 0.14, 0.34, 0.41, 0.09, 0.93, 0.84, 0.88, 0.94, 0.07, 0.38, 0.54, 0.11, 0.8, 0.74, 0.05, 0.48, 0.92, 0.04, 0.29, 0.5, 0.61, 0.26, 0.42, 0.37, 0.94, 0.12, 0.07, 0.96, 0.44, 0.37, 0.14, 0.05, 0.66, 0.58, 0.99, 0.6, 0.06, 0.16, 0.48, 0, 0.44, 0.26, 0.94, 0.72, 0.16, 0.48, 0.69, 0.46, 0.96, 0.71, 0.4, 0.12, 0.24, 0.86, 0.44, 0.5, 0.69, 0.76, 0.16, 0.85, 0.95, 0.59, 0.5, 0.19, 0, 0.88, 0.13, 0.02, 0.94, 0.29, 0.16, 0.4, 0.46, 0.43, 0.52, 0.85, 0.06, 0.55, 0.69, 0.66, 0.66, 0.47, 0.97, 0.4, 0.85, 0.76, 0.53, 0.87, 0.47, 0.01, 0.73, 0.72, 0.19, 0.65, 0.54, 0.34, 0.64, 0.83, 0.71, 0.35, 0.13, 0.39, 0.93, 0.8, 0.76, 0.96, 0.99, 0.61, 0.03, 0.34, 0.28, 0.12, 0.04, 0.37, 0.34, 0.17, 0.62, 0.4, 0.96, 0.65, 0.33, 0.2, 0.12, 1, 0.38, 0.56, 0.73, 0.87, 0.57, 0.01, 0.91, 0.77, 0.38, 0.09, 0.05, 0.82, 0.83, 0.65, 0.13, 0.34, 0.73, 0.91, 0.7, 0.24, 0.64, 0.28, 0.96, 0.16, 0.42, 0.25, 0.09, 0.83, 0.53, 0.67, 0.41, 0.84, 0.74, 0.35, 0.95, 0.65, 0.04, 0.6, 0.42, 0.08, 0.53, 0.96, 0.71, 0.55, 0.24, 0.78, 0.65, 0.83, 0.65, 0.48, 0.5, 0.38), 
                  x3=c(0.45, 0.81, 0.93, 0.15, 0.75, 0.98, 0.97, 0.35, 0.39, 0.95, 0.11, 0.93, 0.35, 0.53, 0.54, 0.71, 0.41, 0.15, 0.34, 0.63, 0.06, 0.85, 0.21, 0.77, 0.14, 0.32, 0.62, 0.26, 0.63, 0.49, 0.94, 0.86, 0.37, 0.31, 0.83, 0.45, 0.32, 0.1, 0.06, 0.69, 0.67, 0.9, 0.3, 0.93, 0.2, 0.79, 0.22, 0.03, 0.86, 0.69, 0.94, 0.68, 0.84, 0.36, 0.39, 0.57, 0.1, 0.19, 0.59, 0.75, 0.87, 0.37, 0.8, 0.06, 0.62, 0.36, 0.59, 0.91, 0.2, 0.37, 0.67, 0.77, 0.52, 0.83, 0.53, 0.5, 0.42, 0.36, 0.12, 0.3, 0.28, 0.79, 0.78, 0.14, 0.52, 0.6, 0.51, 0.39, 0.43, 0.01, 0.92, 0.08, 0.51, 0.82, 0.6, 0.42, 0.56, 0.79, 0.17, 0.97, 0.47, 0.93, 0.9, 0.75, 0.68, 0.65, 0.07, 0.42, 0.53, 0.94, 0.71, 0.72, 0.47, 0.12, 0.78, 0.44, 0.43, 0.03, 0.15, 0.42, 0.77, 0, 0.6, 0.91, 0.71, 0.26, 0.85, 0.33, 0.58, 0.43, 0.05, 0.73, 0.55, 0.75, 0.05, 0.71, 0.3, 0.28, 0.83, 0.09, 0.04, 0.35, 0.54, 0.61, 0.27, 0.21, 0.38, 0.47, 0.84, 0.12, 0.68, 0.5, 0.9, 0.55, 0.13, 0.44, 0.19, 0.43, 0.23, 0.96, 0.45, 0.78, 0.16, 0.87, 0.21, 0.18, 0.16, 0.57, 0.73, 0.88, 0.71, 0.48, 0.82, 0.02, 1, 0.63, 0.43, 0.03, 0.75, 0.21, 1, 0.91, 0.71, 0.73, 0.47, 0.86, 0.17, 0.62, 0.29, 0.46, 0.05, 0.18, 0.06, 0.94, 0.34, 0.52, 0.63, 0.24, 0.52, 0.81)
                  )
str(dat)
## 'data.frame':    200 obs. of  5 variables:
##  $ y : num  11.17 2.81 12.9 5.68 5.58 ...
##  $ x0: num  0.9 0.27 0.37 0.57 0.91 0.2 0.9 0.94 0.66 0.63 ...
##  $ x1: num  0.78 0.27 0.22 0.52 0.27 0.18 0.52 0.56 0.13 0.26 ...
##  $ x2: num  0.15 0.66 0.19 0.95 0.9 0.94 0.72 0.37 0.78 0.01 ...
##  $ x3: num  0.45 0.81 0.93 0.15 0.75 0.98 0.97 0.35 0.39 0.95 ...
# Fit the model
mod <- gam(y ~ s(x0, k = 5) + s(x1, k = 5) + s(x2, k = 5) + s(x3, k = 5),
           data = dat, method = "REML")

# Run the check function
gam.check(mod)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 10 iterations.
## Gradient range [-0.0001426464,0.0001241444]
## (score 461.1064 & scale 5.242973).
## Hessian positive definite, eigenvalue range [0.0001426384,97.53228].
## Model rank =  17 / 17 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##         k'  edf k-index p-value    
## s(x0) 4.00 2.53    0.92    0.12    
## s(x1) 4.00 2.22    1.07    0.80    
## s(x2) 4.00 3.94    0.84  <2e-16 ***
## s(x3) 4.00 1.00    1.01    0.48    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Fit the model
mod <- gam(y ~ s(x0, k = 3) + s(x1, k = 3) + s(x2, k = 3) + s(x3, k = 3),
           data = dat, method = "REML")

# Check the diagnostics
gam.check(mod)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 10 iterations.
## Gradient range [-0.0002159481,0.0007368124]
## (score 493.6694 & scale 7.805066).
## Hessian positive definite, eigenvalue range [0.0002170151,97.50484].
## Model rank =  9 / 9 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##         k'  edf k-index p-value    
## s(x0) 2.00 1.85    0.97    0.30    
## s(x1) 2.00 1.71    1.06    0.75    
## s(x2) 2.00 1.97    0.57  <2e-16 ***
## s(x3) 2.00 1.00    1.09    0.85    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Refit to fix issues
mod2 <- gam(y ~ s(x0, k = 3) + s(x1, k = 3) + s(x2, k = 12) + s(x3, k = 3),
           data = dat, method = "REML")

# Check the new model
gam.check(mod2)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 9 iterations.
## Gradient range [-0.0001262011,0.0001907036]
## (score 452.0731 & scale 4.569005).
## Hessian positive definite, eigenvalue range [0.01536015,97.63581].
## Model rank =  18 / 18 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##          k'   edf k-index p-value
## s(x0)  2.00  1.93    0.91    0.10
## s(x1)  2.00  1.89    1.12    0.97
## s(x2) 11.00  8.07    0.97    0.29
## s(x3)  2.00  1.18    1.04    0.76
# Fit the model
mod <- gam(hw.mpg ~ s(length) + s(width) + s(height) + s(weight), 
           data = mpg, method = "REML")

# Check overall concurvity
concurvity(mod, full=TRUE)
##                  para s(length)  s(width) s(height) s(weight)
## worst    1.079374e-20 0.9303404 0.9322887 0.6723705 0.9603887
## observed 1.079374e-20 0.7534619 0.8757513 0.4869308 0.8793300
## estimate 1.079374e-20 0.8353324 0.7943374 0.4452676 0.8567519
# Check pairwise concurvity
concurvity(mod, full=FALSE)
## $worst
##                   para    s(length)     s(width)    s(height)    s(weight)
## para      1.000000e+00 4.799804e-26 5.458174e-21 4.926340e-23 3.221614e-25
## s(length) 4.759962e-26 1.000000e+00 8.336513e-01 6.058015e-01 8.797217e-01
## s(width)  5.458344e-21 8.336513e-01 1.000000e+00 4.099837e-01 8.953662e-01
## s(height) 4.927251e-23 6.058015e-01 4.099837e-01 1.000000e+00 3.665831e-01
## s(weight) 3.233688e-25 8.797217e-01 8.953662e-01 3.665831e-01 1.000000e+00
## 
## $observed
##                   para    s(length)     s(width)    s(height)    s(weight)
## para      1.000000e+00 1.128295e-29 4.467995e-32 9.887661e-34 6.730965e-31
## s(length) 4.759962e-26 1.000000e+00 7.511142e-01 2.827977e-01 8.232449e-01
## s(width)  5.458344e-21 5.077384e-01 1.000000e+00 1.186126e-01 7.813743e-01
## s(height) 4.927251e-23 2.284116e-01 3.313152e-01 1.000000e+00 2.900361e-01
## s(weight) 3.233688e-25 6.052819e-01 7.863555e-01 1.494913e-01 1.000000e+00
## 
## $estimate
##                   para    s(length)     s(width)    s(height)    s(weight)
## para      1.000000e+00 1.564968e-28 1.740649e-23 3.448567e-25 1.481483e-27
## s(length) 4.759962e-26 1.000000e+00 6.415191e-01 2.271285e-01 7.209033e-01
## s(width)  5.458344e-21 6.477497e-01 1.000000e+00 1.054762e-01 7.241891e-01
## s(height) 4.927251e-23 3.303484e-01 2.644827e-01 1.000000e+00 2.669300e-01
## s(weight) 3.233688e-25 7.235198e-01 6.913221e-01 1.390568e-01 1.000000e+00

Chapter 3 - Spatial GAMs and Interactions

Two-dimensional smooths and spatial data:

  • Can expand models to include smooths of multiple variables, including their interactions
    • y = s(x1, x2)
    • gam(y ~ s(x1, x2), data = dat, method = “REML”)
    • gam(y ~ s(x1, x2) + s(x3) + x4, data = dat, method = “REML”)
  • Will use the meuse dataset, which is about heavy metals in the soil near a river
    • ?sp::meuse

Plotting and interpreting GAM interactions:

  • Can see interactions using plot(mod_2d)
    • plot(mod_2d)
    • Creates a topographic map of predicted values
    • plot(mod_2d, scheme = 1) # will show the 3D perspective plot
    • plot(mod_2d, scheme = 2) # heat map
    • vis.gam(x, view = NULL, cond = list(), n.grid = 30, too.far = 0, col = NA, color = “heat”, contour.col = NULL, se = -1, type = “link”, plot.type = “persp”, zlim = NULL, nCol = 50, …) # customizes plots
  • Can run custom plots using the vis.gam() functions
    • vis.gam(x = mod, view = c(“x1”, “x2”), plot.type = “persp”) # scheme=1
    • vis.gam(x = mod, view = c(“x1”, “x2”), plot.type = “contour”) # scheme=2
    • vis.gam(mod, view = c(“x1”, “x2”), plot.type = “contour”, too.far = 0.1) # set a range for not making predictions due to distance from training data
    • vis.gam(x = mod, view = c(“x1”, “x2”), plot.type = “persp”, se = 2) # see confidence intervals for the plots
    • vis.gam(g, view = c(“x1”, “x2”), plot.type = “persp”, theta = 220) # horizontal rotation
    • vis.gam(g, view = c(“x1”, “x2”), plot.type = “persp”, phi = 55) # vertical rotation
    • vis.gam(g, view = c(“x1”, “x2”), plot.type = “persp”, r = 0.1) # zoom level (low r can lead to distortions or parallax)
  • Additional options are available for contour plots
    • vis.gam(g, view = c(“x1”, “x2”), plot.type = “contour”, color = “gray”)
    • vis.gam(g, view = c(“x1”, “x2”), plot.type = “contour”, contour.col = “blue”)
    • vis.gam(g, view = c(“x1”, “x2”), plot.type = “contour”, nlevels = 20)

Visualizing categorical-continuous interactions:

  • The categorical-continuous interaction was previously modeled using s(x1, by=“x2”)
  • Can instead use a factor smooth with argument bs=“fs” (best for controlling for categories that may have an impact but are not the primary categories of interest)
    • model4c <- gam(hw.mpg ~ s(weight, fuel, bs = “fs”) + fuel, data = mpg, method = “REML”)
    • plot(model4c)
    • vis.gam(model4c, theta = 125, plot.type = “persp”)

Interactions with different scales: Tensors:

  • Tensor smooths allow for interactions on different scales, such as space and time
  • Within the meuse data, horizontal and vertical dstances may have very different wiggliness (incomparable impacts)
  • A tensor has two smoothing parameters, one for each of it variables
    • gam(y ~ te(x1, x2), data = data, method = “REML”)
    • gam(y ~ te(x1, x2, k = c(10, 20)), data = data, method = “REML”)
  • Tensor smooths can also help to tease out interactions and independent effects
    • gam(y ~ te(x1) + te(x2) + ti(x1, x2), data = data, method = “REML”)
    • gam(y ~ s(x1) + s(x2) + ti(x1, x2), data = data, method = “REML”)

Example code includes:

# Inspect the data
data(meuse, package="sp")
head(meuse)
##        x      y cadmium copper lead zinc  elev       dist   om ffreq soil
## 1 181072 333611    11.7     85  299 1022 7.909 0.00135803 13.6     1    1
## 2 181025 333558     8.6     81  277 1141 6.983 0.01222430 14.0     1    1
## 3 181165 333537     6.5     68  199  640 7.800 0.10302900 13.0     1    1
## 4 181298 333484     2.6     81  116  257 7.655 0.19009400  8.0     1    2
## 5 181307 333330     2.8     48  117  269 7.480 0.27709000  8.7     1    2
## 6 181390 333260     3.0     61  137  281 7.791 0.36406700  7.8     1    2
##   lime landuse dist.m
## 1    1      Ah     50
## 2    1      Ah     30
## 3    1      Ah    150
## 4    0      Ga    270
## 5    0      Ah    380
## 6    0      Ga    470
str(meuse)
## 'data.frame':    155 obs. of  14 variables:
##  $ x      : num  181072 181025 181165 181298 181307 ...
##  $ y      : num  333611 333558 333537 333484 333330 ...
##  $ cadmium: num  11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
##  $ copper : num  85 81 68 81 48 61 31 29 37 24 ...
##  $ lead   : num  299 277 199 116 117 137 132 150 133 80 ...
##  $ zinc   : num  1022 1141 640 257 269 ...
##  $ elev   : num  7.91 6.98 7.8 7.66 7.48 ...
##  $ dist   : num  0.00136 0.01222 0.10303 0.19009 0.27709 ...
##  $ om     : num  13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
##  $ ffreq  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ soil   : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
##  $ lime   : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
##  $ dist.m : num  50 30 150 270 380 470 240 120 240 420 ...
# Fit the 2-D model
mod2d <- gam(cadmium ~ s(x, y), data=meuse, method="REML")

# Inspect the model
summary(mod2d)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## cadmium ~ s(x, y)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.2458     0.1774    18.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(x,y) 23.48  27.24 8.667  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.607   Deviance explained = 66.7%
## -REML = 372.07  Scale est. = 4.8757    n = 155
coef(mod2d)
## (Intercept)    s(x,y).1    s(x,y).2    s(x,y).3    s(x,y).4    s(x,y).5 
##   3.2458065   0.8686658 -10.2154908   6.4161781  -2.6784725   9.1807111 
##    s(x,y).6    s(x,y).7    s(x,y).8    s(x,y).9   s(x,y).10   s(x,y).11 
##   3.7004932 -10.4780937  -8.9821840  -0.6218677  -4.6789789  -5.4267313 
##   s(x,y).12   s(x,y).13   s(x,y).14   s(x,y).15   s(x,y).16   s(x,y).17 
##   7.4996452   8.1962843  -7.6311640   4.5829340  -0.9734724   0.7634059 
##   s(x,y).18   s(x,y).19   s(x,y).20   s(x,y).21   s(x,y).22   s(x,y).23 
##   8.8112827  -4.8639552  -6.8085148   3.8059356   6.3499868   6.4701169 
##   s(x,y).24   s(x,y).25   s(x,y).26   s(x,y).27   s(x,y).28   s(x,y).29 
##  -8.1556061   7.2050985   0.1567317 -53.4384704  -4.2860149   5.5212533
# Models of this form (s(x,y) + s(v1) + ...) are a great way to model spatial data because they incorporate spatial relationships as well as independent predictors

# Fit the model
mod2da <- gam(cadmium ~ s(x, y) +s(elev) + s(dist), 
              data = meuse, method = "REML")

# Inspect the model
summary(mod2da)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## cadmium ~ s(x, y) + s(elev) + s(dist)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.2458     0.1238   26.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df      F  p-value    
## s(x,y)  20.398 24.599  2.324  0.00078 ***
## s(elev)  1.282  1.496 28.868 6.52e-08 ***
## s(dist)  6.609  7.698 13.677 5.25e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.809   Deviance explained = 84.4%
## -REML = 321.06  Scale est. = 2.3762    n = 155
# Contour plot
plot(mod2da, pages = 1)

# 3D surface plot
plot(mod2da, scheme=1, pages = 1)

# Colored heat map
plot(mod2da, scheme=2, pages=1)

# Make the perspective plot with error surfaces
vis.gam(mod2d, view = c("x", "y"), plot.type="persp", se=2)

# Rotate the same plot
vis.gam(mod2d, view = c("x", "y"), plot.type="persp", se=2, theta=135)

# Make plot with 5% extrapolation
vis.gam(mod2d, view = c("x", "y"), plot.type = "contour", too.far=0.05)

# Overlay data
points(meuse)

# Make plot with 10% extrapolation
vis.gam(mod2d, view = c("x", "y"), plot.type="contour", too.far=0.1)

# Overlay data
points(meuse)

# Make plot with 25% extrapolation
vis.gam(mod2d, view = c("x", "y"), 
        plot.type = "contour", too.far = 0.25)

# Overlay data
points(meuse)

# Fit a model with separate smooths for each land-use level
mod_sep <- gam(copper ~ s(dist, by = landuse), data = meuse, method = "REML")

# Examine the summary
summary(mod_sep)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## copper ~ s(dist, by = landuse)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   36.726      1.371   26.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                      edf Ref.df      F  p-value    
## s(dist):landuseAa  1.371  1.605  0.493  0.43082    
## s(dist):landuseAb  1.000  1.000  1.674  0.19792    
## s(dist):landuseAg  1.514  1.815  0.940  0.28255    
## s(dist):landuseAh  2.496  3.081  8.783 1.96e-05 ***
## s(dist):landuseAm  1.000  1.000  8.606  0.00395 ** 
## s(dist):landuseB   1.000  1.000  1.207  0.27401    
## s(dist):landuseBw  1.000  1.000  0.007  0.93520    
## s(dist):landuseDEN 1.000  1.000  0.230  0.63255    
## s(dist):landuseFh  1.000  1.000  0.698  0.40494    
## s(dist):landuseFw  2.754  3.377  5.289  0.00120 ** 
## s(dist):landuseGa  2.791  2.958  3.720  0.01092 *  
## s(dist):landuseSPO 1.000  1.000  1.101  0.29599    
## s(dist):landuseSTA 1.245  1.430  0.179  0.65089    
## s(dist):landuseTv  1.000  1.000  0.698  0.40495    
## s(dist):landuseW   4.333  5.289 37.857  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.642   Deviance explained = 69.9%
## -REML = 580.91  Scale est. = 195.07    n = 154
# Fit a model with a factor-smooth interaction
mod_fs <- gam(copper ~ s(dist, landuse, bs="fs"), data = meuse, method = "REML")

# Examine the summary
summary(mod_fs)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## copper ~ s(dist, landuse, bs = "fs")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    30.07       3.33   9.031 1.43e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df     F p-value    
## s(dist,landuse) 16.37     71 2.463  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.533   Deviance explained = 58.3%
## -REML = 659.94  Scale est. = 254.2     n = 154
# Plot both the models with plot()
plot(mod_sep, pages=1)

plot(mod_fs, pages=1)

# Plot both the models with vis.gam()
vis.gam(mod_sep, view = c("dist", "landuse"), plot.type = "persp")

vis.gam(mod_fs, view = c("dist", "landuse"), plot.type = "persp")

# Fit the model
tensor_mod <- gam(cadmium ~ te(x, y, elev), data=meuse, method="REML")

# Summarize and plot
summary(tensor_mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## cadmium ~ te(x, y, elev)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.2458     0.1329   24.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## te(x,y,elev) 38.29  45.86 11.87  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.78   Deviance explained = 83.4%
## -REML = 318.09  Scale est. = 2.7358    n = 155
plot(tensor_mod)

# Fit the model
tensor_mod2 <- gam(cadmium ~ ti(x, y) + te(elev) + ti(x, y, elev), data=meuse, method="REML")

# Summarize and plot
summary(tensor_mod2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## cadmium ~ ti(x, y) + te(elev) + ti(x, y, elev)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.5102     0.4311   8.143 3.61e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df      F  p-value    
## ti(x,y)      10.80 12.132  6.026 7.40e-09 ***
## te(elev)      2.79  3.099 11.317 1.14e-06 ***
## ti(x,y,elev) 17.20 22.376  2.759  0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.714   Deviance explained = 77.1%
## -REML = 349.54  Scale est. = 3.5476    n = 155
plot(tensor_mod2, pages = 1)

par(mfrow=c(1, 1))

Chapter 4 - Logistic GAM for Classification

Types of model outcome:

  • Can also use GAM for making classifications (binary regression)
    • GAM output is converted to probability using the logistic function - log-odds - plogis()
    • The logit function converts probabilities to log-odds - qlogis()
    • gam(y ~ x1 + s(x2), data = dat, family = binomial, # Add for binary outcomes method = “REML”)
  • The csale dataset has anonymized insurance data

Visualizing logistic GAMs:

  • Can plot the log-odds
    • plot(binom_mod) # on the log-odds scale
    • plot(binom_mod, pages = 1, trans = plogis) # convert to probability scale
    • plot(binom_mod, pages = 1, trans = plogis, shift = coef(binom_mod)[1]) # add the intercept
    • plot(binom_mod, pages = 1, trans = plogis, shift = coef(binom_mod)[1], seWithMean = TRUE) # adds intercept uncertainty to smooth uncertainty
    • plot(binom_mod, pages = 1, trans = plogis, shift = coef(binom_mod)[1], seWithMean = TRUE, rug = FALSE, shade = TRUE, shade.col = “lightgreen”, col = “purple”)

Making predictions:

  • Can use the fitted models for making predictions
    • predict(log_mod2) # vector of predictions for the model data; returned on the link scale by default (log-odds for logistic)
    • predict(log_mod2, type=“response”) # return on the probability scale
    • predict(log_mod2, type = “link”, se.fit = TRUE) # first element of the list is the predictions and second element is approx. SE
  • For standard errors, use the link (log-odds) scale, then convert to the probability scales
  • Can also make predictions based on new data
    • test_predictions <- predict(trained_model, type = “response”, newdata = test_df)
    • predict(log_mod2, type = “terms”) # will show a column for the impact of each of the smooths (sum across columns plus intercept would be prediction)
    • plogis( sum(predict(log_mod2, type = “terms”)[1, ]) + coef(log_mod2)[1] )

Wrap up and next steps:

  • Basic theory of smooths for GAMs
  • Interpreting GAMs and plotting partial effects
  • Building and visualizing GAMs with interactions
  • Logistic GAMs for binary classification and predictions
  • Can extend to using the Tidyverse tools - broom, caret, etc.
  • Can further extend the smooths - see the help files in mgcv

Example code includes:

csale <- readRDS("./RInputFiles/csale.rds")

# Examine the csale data frame
head(csale)
## # A tibble: 6 x 8
##   purchase n_acts bal_crdt_ratio avg_prem_balance retail_crdt_rat~
##      <dbl>  <dbl>          <dbl>            <dbl>            <dbl>
## 1        0     11            0              2494.              0  
## 2        0      0           36.1            2494.             11.5
## 3        0      6           17.6            2494.              0  
## 4        0      8           12.5            2494.              0.8
## 5        0      8           59.1            2494.             20.8
## 6        0      1           90.1            2494.             11.5
## # ... with 3 more variables: avg_fin_balance <dbl>, mortgage_age <dbl>,
## #   cred_limit <dbl>
str(csale)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1779 obs. of  8 variables:
##  $ purchase         : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ n_acts           : num  11 0 6 8 8 1 5 0 9 18 ...
##  $ bal_crdt_ratio   : num  0 36.1 17.6 12.5 59.1 ...
##  $ avg_prem_balance : num  2494 2494 2494 2494 2494 ...
##  $ retail_crdt_ratio: num  0 11.5 0 0.8 20.8 ...
##  $ avg_fin_balance  : num  1767 1767 0 1021 797 ...
##  $ mortgage_age     : num  182 139 139 139 93 ...
##  $ cred_limit       : num  12500 0 0 0 0 0 0 0 11500 16000 ...
# Fit a logistic model
log_mod <- gam(purchase ~ s(mortgage_age), data = csale, family=binomial, method = "REML")

# Fit a logistic model
log_mod2 <- gam(purchase ~ s(n_acts) + s(bal_crdt_ratio) + s(avg_prem_balance) + 
    s(retail_crdt_ratio) + s(avg_fin_balance) + s(mortgage_age) + 
    s(cred_limit), data = csale, family = binomial, method = "REML")

# View the summary
summary(log_mod2)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## purchase ~ s(n_acts) + s(bal_crdt_ratio) + s(avg_prem_balance) + 
##     s(retail_crdt_ratio) + s(avg_fin_balance) + s(mortgage_age) + 
##     s(cred_limit)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.64060    0.07557  -21.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf Ref.df Chi.sq  p-value    
## s(n_acts)            3.474  4.310 93.670  < 2e-16 ***
## s(bal_crdt_ratio)    4.308  5.257 18.386  0.00318 ** 
## s(avg_prem_balance)  2.275  2.816  7.800  0.04958 *  
## s(retail_crdt_ratio) 1.001  1.001  1.422  0.23343    
## s(avg_fin_balance)   1.850  2.202  2.506  0.27895    
## s(mortgage_age)      4.669  5.710  9.656  0.13401    
## s(cred_limit)        1.001  1.002 23.066 1.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.184   Deviance explained = 18.4%
## -REML = 781.37  Scale est. = 1         n = 1779
# Plot on the log-odds scale
plot(log_mod2, pages=1)

# Plot on the probability scale
plot(log_mod2, pages = 1, trans = plogis)

# Plot with the intercept
plot(log_mod2, pages = 1, trans = plogis, shift = coef(log_mod2)[1])

# Plot with intercept uncertainty
plot(log_mod2, pages = 1, trans = plogis, shift = coef(log_mod2)[1], seWithMean = TRUE)

new_credit_data <- data.frame(matrix(data=c(1, 0, 0, 0, 0, 0, 0, 2, 19, 0, 0, 1, 6, 3, 0.3, 4.2, 36.095, 36.095, 25.7, 45.6, 10.8, 61, 967, 2494.414, 2494.414, 2494.414, 195, 2494.414, 11.491, 0, 11.491, 11.491, 11.491, 0, 11.491, 1767.197, 249, 1767.197, 1767.197, 1767.197, 0, 1767.197, 155, 65, 138.96, 138.96, 138.96, 13, 138.96, 0, 10000, 0, 0, 0, 13800, 0), ncol=8, nrow=7, byrow=FALSE))
names(new_credit_data) <- c('purchase', 'n_acts', 'bal_crdt_ratio', 'avg_prem_balance', 'retail_crdt_ratio', 'avg_fin_balance', 'mortgage_age', 'cred_limit')
new_credit_data
##   purchase n_acts bal_crdt_ratio avg_prem_balance retail_crdt_ratio
## 1        1      2          0.300           61.000            11.491
## 2        0     19          4.200          967.000             0.000
## 3        0      0         36.095         2494.414            11.491
## 4        0      0         36.095         2494.414            11.491
## 5        0      1         25.700         2494.414            11.491
## 6        0      6         45.600          195.000             0.000
## 7        0      3         10.800         2494.414            11.491
##   avg_fin_balance mortgage_age cred_limit
## 1        1767.197       155.00          0
## 2         249.000        65.00      10000
## 3        1767.197       138.96          0
## 4        1767.197       138.96          0
## 5        1767.197       138.96          0
## 6           0.000        13.00      13800
## 7        1767.197       138.96          0
# Calculate predictions and errors
predictions <- predict(log_mod2, newdata = new_credit_data, 
                       type = "link", se.fit = TRUE)

# Calculate high and low predictions intervals
high_pred <- predictions$fit + 2*predictions$se.fit
low_pred <- predictions$fit - 2*predictions$se.fit

# Convert intervals to probability scale
high_prob <- 1 / (1/exp(high_pred) + 1)
low_prob <- 1 / (1/exp(low_pred) + 1)

# Inspect
high_prob
##          1          2          3          4          5          6 
## 0.29534339 0.80264652 0.07015439 0.07015439 0.12907886 0.37928666 
##          7 
## 0.27248599
low_prob
##          1          2          3          4          5          6 
## 0.15023805 0.53570264 0.03758807 0.03758807 0.06804119 0.11279970 
##          7 
## 0.15491807
# Predict from the model
prediction_1 <- predict(log_mod2, newdata = new_credit_data[1, ,drop=FALSE], type = "terms")

# Inspect
prediction_1
##    s(n_acts) s(bal_crdt_ratio) s(avg_prem_balance) s(retail_crdt_ratio)
## 1 -0.3626621         0.3352521            0.369506         -0.007531015
##   s(avg_fin_balance) s(mortgage_age) s(cred_limit)
## 1        -0.04057248      -0.1774484     0.2229033
## attr(,"constant")
## (Intercept) 
##   -1.640601

Machine Learning in the Tidyverse

Chapter 1 - Foundations of Tidy Machine Learning

Introduction:

  • Tidyverse tools center around the tibble, which includes a “list” column that can store complex objects
  • There is also a list-column workflow of nest() - map() - unnest()
  • Course will use a more granular form of the gapminder dataset - 77 countries x 52 years x 6 features per observation
  • The nest makes each portion of the data in to a row of a new tibble
    • nested <- gapminder %>% group_by(country) %>% nest() # new column data is created; will contain the relevant data
    • nested$data[[4]] # will show the Austria data
    • nested %>% unnest(data) # will convert back to a normal tibble, using column data

Map family of functions:

  • The map family of functions help to fulfill the second and third elements of the workflows
  • The map functions are all of the form map_(.x=, .f=)
    • .x is the vector or list
    • .f is the function
    • output will be a list
    • .f = ~mean(.x) and .f=mean will do the same thing
  • Can use the map functions on a nested file
    • map(.x = nested\(data, .f = ~mean(.x\)population))
    • pop_df <- nested %>% mutate(pop_mean = map(data, ~mean(.x$population)))
    • pop_df %>% unnest(pop_mean) # gets back the numbers rather than a column of lists
  • Can also use a map function that specifies a requested vector return of a specific type or a list of models such as an lm
    • nested %>% mutate(pop_mean = map_dbl(data, ~mean(.x$population))) # return will be a vector of doubles rather than a list
    • nested %>% mutate(model = map(data, ~lm(formula = population~fertility, data = .x)))

Tidy models with broom:

  • Several packages can help with the analysis of the list columns - broom, yardstick, rsample, etc.
  • The core of the broom is to extract relevant data from models
    • tidy - statistical fundings such as coefficients
    • glance - concise one-row summary
    • augment - adds prediction columns to the data being modeled
  • Example of using tidy to extract data from the lm() used previously in the chapter
    • tidy(algeria_model)
    • glance(algeria_model)
    • augment(algeria_model)
  • Plotting the augmented data can give a sense for model fits - predictions vs. actuals
    • augment(algeria_model) %>% ggplot(mapping = aes(x = year)) + geom_point(mapping = aes(y = life_expectancy)) + geom_line(mapping = aes(y = .fitted), color = “red”)

Example code includes:

# Explore gapminder
data(gapminder, package="gapminder")
head(gapminder)

# Prepare the nested dataframe gap_nested
gap_nested <- gapminder %>% 
  group_by(country) %>% 
  nest()

# Explore gap_nested
head(gap_nested)


# Create the unnested dataframe called gap_unnnested
gap_unnested <- gap_nested %>% 
  unnest()
  
# Confirm that your data was not modified  
identical(gapminder, gap_unnested)


# Extract the data of Algeria
algeria_df <- gap_nested$data[[which(gap_nested$country=="Algeria")]]

# Calculate the minimum of the population vector
min(algeria_df$pop)

# Calculate the maximum of the population vector
max(algeria_df$pop)

# Calculate the mean of the population vector
mean(algeria_df$pop)


# Calculate the mean population for each country
pop_nested <- gap_nested %>%
  mutate(mean_pop = map(.x=data, .f=~mean(.x$pop)))

# Take a look at pop_nested
head(pop_nested)

# Extract the mean_pop value by using unnest
pop_mean <- pop_nested %>% 
  unnest(mean_pop)

# Take a look at pop_mean
head(pop_mean)


# Calculate mean population and store result as a double
pop_mean <- gap_nested %>%
  mutate(mean_pop = map_dbl(.x=data, ~mean(.x$pop)))

# Take a look at pop_mean
head(pop_mean)


# Build a linear model for each country
gap_models <- gap_nested %>%
    mutate(model = map(.x=data, .f=~lm(formula = lifeExp ~ year, data = .x)))
    
# Extract the model for Algeria    
algeria_model <- gap_models$model[[which(gap_models$country=="Algeria")]]

# View the summary for the Algeria model
summary(algeria_model)


# Extract the coefficients of the algeria_model as a dataframe
broom::tidy(algeria_model)

# Extract the statistics of the algeria_model as a dataframe
broom::glance(algeria_model)


# Build the augmented dataframe
algeria_fitted <- broom::augment(algeria_model)

# Compare the predicted values with the actual values of life expectancy
algeria_fitted %>% 
  ggplot(aes(x = year)) +
  geom_point(aes(y = lifeExp)) + 
  geom_line(aes(y = .fitted), color = "red")

Chapter 2 - Multiple Models with broom

Exploring coefficients across models:

  • The gapminder models contains data about all of the 77 countries
    • gap_nested <- gapminder %>% group_by(country) %>% nest()
    • gap_models <- gap_nested %>% mutate(model = map(data, lm(life_expectancyyear, data = .x)))
  • Can extract regression coefficients to better understand trends in the gapminder data
    • gap_models %>% mutate(coef = map(model, ~tidy(.x))) %>% unnest(coef)

Evaluating fit of many models:

  • Can also look at the R-squared across many models
    • gap_models %>% mutate(fit = map(model, ~glance(.x))) %>% unnest(fit)

Visually inspect the fit of many models:

  • Can compare predicted and actual observations from the gapminder dataset
    • augmented_models <- gap_models %>% mutate(augmented = map(model,~augment(.x))) %>% unnest(augmented)
    • augmented_model %>% filter(country == “Italy”) %>% ggplot(aes(x = year, y = life_expectancy)) + geom_point() + geom_line(aes(y = .fitted), color = “red”)

Improve the fit of your models:

  • Can instead use multiple linear regressions and again explore the goodness of fits
    • gap_fullmodels <- gap_nested %>% mutate(model = map(data, ~lm(formula = life_expectancy ~ ., data = .x)))
    • tidy(gap_fullmodels$model[[1]])
    • augment(gap_fullmodels$model[[1]])
    • glance(gap_fullmodels$model[[1]])

Example code includes:

# Extract the coefficient statistics of each model into nested dataframes
model_coef_nested <- gap_models %>% 
    mutate(coef = map(.x=model, .f=~broom::tidy(.x)))
    
# Simplify the coef dataframes for each model    
model_coef <- model_coef_nested %>%
    unnest(coef)

# Plot a histogram of the coefficient estimates for year         
model_coef %>% 
  filter(term=="year") %>% 
  ggplot(aes(x = estimate)) +
  geom_histogram()


# Extract the fit statistics of each model into dataframes
model_perf_nested <- gap_models %>% 
    mutate(fit = map(.x=model, .f=~broom::glance(.x)))

# Simplify the fit dataframes for each model    
model_perf <- model_perf_nested %>% 
    unnest(fit)
    
# Look at the first six rows of model_perf
head(model_perf)


# Plot a histogram of rsquared for the 77 models    
model_perf %>% 
  ggplot(aes(x=r.squared)) + 
  geom_histogram() 
  
# Extract the 4 best fitting models
best_fit <- model_perf %>% 
  top_n(n = 4, wt = r.squared)

# Extract the 4 models with the worst fit
worst_fit <- model_perf %>% 
  top_n(n = 4, wt = -r.squared)


best_augmented <- best_fit %>% 
  # Build the augmented dataframe for each country model
  mutate(augmented = map(.x=model, .f=~broom::augment(.x))) %>% 
  # Expand the augmented dataframes
  unnest(augmented)

worst_augmented <- worst_fit %>% 
  # Build the augmented dataframe for each country model
  mutate(augmented = map(.x=model, .f=~broom::augment(.x))) %>% 
  # Expand the augmented dataframes
  unnest(augmented)


# Compare the predicted values with the actual values of life expectancy 
# for the top 4 best fitting models
best_augmented %>% 
  ggplot(aes(x=year)) +
  geom_point(aes(y=lifeExp)) + 
  geom_line(aes(y=.fitted), color = "red") +
  facet_wrap(~country, scales = "free_y")

# Compare the predicted values with the actual values of life expectancy 
# for the top 4 worst fitting models
worst_augmented %>% 
  ggplot(aes(x=year)) +
  geom_point(aes(y=lifeExp)) + 
  geom_line(aes(y=.fitted), color = "red") +
  facet_wrap(~country, scales = "free_y")


# Build a linear model for each country using all features
gap_fullmodel <- gap_nested %>% 
  mutate(model = map(data, ~lm(formula = lifeExp ~ year + pop + gdpPercap, data = .x)))

fullmodel_perf <- gap_fullmodel %>% 
  # Extract the fit statistics of each model into dataframes
  mutate(fit = map(model, ~broom::glance(.x))) %>% 
  # Simplify the fit dataframes for each model
  unnest(fit)
  
# View the performance for the four countries with the worst fitting 
# four simple models you looked at before
fullmodel_perf %>% 
  filter(country %in% worst_fit$country) %>% 
  select(country, adj.r.squared)

Chapter 3 - Build, Tune, and Evaluate Regression Models

Training, test, and validation splits:

  • Questions of how well the data would perform on new (unseen) data
  • The train-test split can be helpful for assessing out-of-sample performance - rsample has a function for this, with prop= being the proportion in the test data
    • library(rsample)
    • gap_split <- initial_split(gapminder, prop = 0.75)
    • training_data <- training(gap_split)
    • testing_data <- testing(gap_split)
  • The train data can be further split in to train and validate, with validate being used in model building (cross-validation)
    • cv_split <- vfold_cv(training_data, v = 3) # v=3 is for 3-fold cross-validation
    • cv_data <- cv_split %>% mutate(train = map(splits, ~training(.x)), validate = map(splits, ~testing(.x)))

Measuring cross-validation performance:

  • Can compare predictions made on the test or validate data to the actual values in the same datasets
  • The MAE captures the average magnitude of differences in the predictions and the actuals
    • cv_prep_lm <- cv_models_lm %>% mutate(validate_actual = map(validate, ~.x$life_expectancy))
  • There is also a map2(.x=, .y=, .f=) function that allows for two input columns
    • cv_prep_lm <- cv_eval_lm %>% mutate(validate_actual = map(validate, ~.x$life_expectancy), validate_predicted = map2(model, validate, ~predict(.x, .y)))
    • library(Metrics)
    • cv_eval_lm <- cv_prep_lm %>% mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y)))

Building and tuning a random-forest model:

  • MAE for the gapminder model above is roughly 1.5 years
  • Can instead try a random forest model - can handle non-linear relationships and interactions
    • rf_model <- ranger(formula = , data = , seed = ___)
    • prediction <- predict(rf_model, new_data)$predictions
    • rf_model <- ranger(formula, data, seed, mtry, num.trees) # tuning the hyper-parameters
  • Can tune the mtry parameter using the crossing() function for a tidyverse approach
    • cv_tune <- cv_data %>% crossing(mtry = 1:5) # expands the frame for each hyperparameter of interest
    • cv_model_tunerf <- cv_tune %>% mutate(model = map2(train, mtry, ~ranger(formula = life_expectancy~., data = .x, mtry = .y)))

Measuring the test performance:

  • Can compare multiple models, including hyper-parameters
  • The final model is then built using ALL the training data, with the OOB error estimate based on the held-out test data
    • best_model <- ranger(formula = life_expectancy~., data = training_data, mtry = 2, num.trees = 100, seed = 42)
    • test_actual <- testing_data$life_expectancy
    • test_predict <- predict(best_model, testing_data)$predictions
    • mae(test_actual, test_predict)

Example code includes:

set.seed(42)

# Prepare the initial split object
gap_split <- rsample::initial_split(gapminder, prop = 0.75)

# Extract the training dataframe
training_data <- rsample::training(gap_split)

# Extract the testing dataframe
testing_data <- rsample::testing(gap_split)

# Calculate the dimensions of both training_data and testing_data
dim(training_data)
dim(testing_data)


set.seed(42)

# Prepare the dataframe containing the cross validation partitions
cv_split <- rsample::vfold_cv(training_data, v = 5)

cv_data <- cv_split %>% 
  mutate(
    # Extract the train dataframe for each split
    train = map(splits, ~rsample::training(.x)), 
    # Extract the validate dataframe for each split
    validate = map(splits, ~rsample::testing(.x))
  )

# Use head() to preview cv_data
head(cv_data)


# Build a model using the train data for each fold of the cross validation
cv_models_lm <- cv_data %>% 
  mutate(model = map(train, ~lm(formula = lifeExp ~ ., data = .x)))


cv_prep_lm <- cv_models_lm %>% 
  mutate(
    # Extract the recorded life expectancy for the records in the validate dataframes
    validate_actual = map(.x=validate, .f=~.x$lifeExp),
    # Predict life expectancy for each validate set using its corresponding model
    validate_predicted = map2(.x=model, .y=validate, .f=~predict(.x, .y))
  )


library(Metrics)
# Calculate the mean absolute error for each validate fold       
cv_eval_lm <- cv_prep_lm %>% 
  mutate(validate_mae = map2_dbl(.x=validate_actual, .y=validate_predicted, 
                                 .f=~mae(actual = .x, predicted = .y)
                                 )
         )

# Print the validate_mae column
cv_eval_lm$validate_mae

# Calculate the mean of validate_mae column
mean(cv_eval_lm$validate_mae)


library(ranger)

# Build a random forest model for each fold
cv_models_rf <- cv_data %>% 
  mutate(model = map(train, ~ranger(formula = lifeExp ~ ., data = .x,
                                    num.trees = 100, seed = 42)))

# Generate predictions using the random forest model
cv_prep_rf <- cv_models_rf %>% 
  mutate(validate_predicted = map2(.x=model, .y=validate, .f=~predict(.x, .y)$predictions))


# Calculate validate MAE for each fold
cv_eval_rf <- cv_prep_rf %>% 
  mutate(validate_actual=map(.x=validate, .f=~.x$lifeExp), 
         validate_mae = map2_dbl(.x=validate_actual, .y=validate_predicted, 
                                 .f=~mae(actual = .x, predicted = .y)
                                 )
         )

# Print the validate_mae column
cv_eval_rf$validate_mae

# Calculate the mean of validate_mae column
mean(cv_eval_rf$validate_mae)


# Prepare for tuning your cross validation folds by varying mtry
cv_tune <- cv_data %>% 
  tidyr::crossing(mtry = 2:5) 

# Build a model for each fold & mtry combination
cv_model_tunerf <- cv_tune %>% 
  mutate(model = map2(.x=train, .y=mtry, ~ranger(formula = lifeExp ~ ., 
                                           data = .x, mtry = .y, 
                                           num.trees = 100, seed = 42)))


# Generate validate predictions for each model
cv_prep_tunerf <- cv_model_tunerf %>% 
  mutate(validate_predicted = map2(.x=model, .y=validate, .f=~predict(.x, .y)$predictions))

# Calculate validate MAE for each fold and mtry combination
cv_eval_tunerf <- cv_prep_tunerf %>% 
  mutate(validate_actual=map(.x=validate, .f=~.x$lifeExp),
         validate_mae = map2_dbl(.x=validate_actual, .y=validate_predicted, 
                                 .f=~mae(actual = .x, predicted = .y)
                                 )
         )

# Calculate the mean validate_mae for each mtry used  
cv_eval_tunerf %>% 
  group_by(mtry) %>% 
  summarise(mean_mae = mean(validate_mae))


# Build the model using all training data and the best performing parameter
best_model <- ranger(formula = lifeExp ~ ., data = training_data,
                     mtry = 4, num.trees = 100, seed = 42)

# Prepare the test_actual vector
test_actual <- testing_data$lifeExp

# Predict life_expectancy for the testing_data
test_predicted <- predict(best_model, testing_data)$predictions

# Calculate the test MAE
mae(test_actual, test_predicted)

Chapter 4 - Build, Tune, and Evaluate Classification Models

Logistic Regression Models:

  • Binary classification models can also be run using the nest and map approach
  • The dataset of interest is “attrition” based on a fictional employer
    • glm(formula = , data = , family = “binomial”)
    • head(cv_data)
    • cv_models_lr <- cv_data %>% mutate(model = map(train, ~glm(formula = Attrition~., data = .x, family = “binomial”)))

Evaluating Classification Models:

  • Need actual and predicted classes, plus a relevant metric
    • validate_prob <- predict(model, validate, type = “response”)
    • table(validate_actual, validate_predicted)
    • accuracy(validate_actual, validate_predicted)
    • precision(validate_actual, validate_predicted) # of observations with predicted==TRUE, how often is actual==TRUE?
    • recall(validate_actual, validate_predicted) # of observations with actual==TRUE, how often is predicted==TRUE?

Random Forest for Classification:

  • Can tune and build the random-forest model as per previous
    • cv_tune <- cv_data %>% crossing(mtry = c(2, 4, 8, 16))
    • cv_models_rf <- cv_tune %>% mutate(model = map2(train, mtry, ~ranger(formula = Attrition~., data = .x, mtry = .y, num.trees = 100, seed = 42)))
    • validate_classes <- predict(rf_model, rf_validate)$predictions
    • validate_predicted <- validate_classes == “Yes”

Wrap Up:

  • List Column Workflow
  • Explore Models with Broom
  • Build, Tune, and Evaluate Regression Models
  • Build, Tune, and Evaluate Classification Models

Example code includes:

attrition <- readRDS("./RInputFiles/attrition.rds")
str(attrition)
head(attrition)


set.seed(42)

# Prepare the initial split object
data_split <- rsample::initial_split(data=attrition, prop=0.75)

# Extract the training dataframe
training_data <- rsample::training(data_split)

# Extract the testing dataframe
testing_data <- rsample::testing(data_split)

set.seed(42)
cv_split <- rsample::vfold_cv(training_data, v=5)

cv_data <- cv_split %>% 
  mutate(
    # Extract the train dataframe for each split
    train = map(.x=splits, .f=~rsample::training(.x)),
    # Extract the validate dataframe for each split
    validate = map(.x=splits, .f=~rsample::testing(.x))
  )


# Build a model using the train data for each fold of the cross validation
cv_models_lr <- cv_data %>% 
    mutate(model = map(train, ~glm(formula = Attrition ~ ., data = .x, family = "binomial")))


# Extract the first model and validate 
model <- cv_models_lr$model[[1]]
validate <- cv_models_lr$validate[[1]]

# Prepare binary vector of actual Attrition values in validate
validate_actual <- validate$Attrition == "Yes"

# Predict the probabilities for the observations in validate
validate_prob <- predict(model, validate, type = "response")

# Prepare binary vector of predicted Attrition values for validate
validate_predicted <- validate_prob > 0.5


library(Metrics)

# Compare the actual & predicted performance visually using a table
table(validate_actual, validate_predicted)

# Calculate the accuracy
accuracy(validate_actual, validate_predicted)

# Calculate the precision
precision(validate_actual, validate_predicted)

# Calculate the recall
recall(validate_actual, validate_predicted)


cv_prep_lr <- cv_models_lr %>% 
  mutate(
    # Prepare binary vector of actual Attrition values in validate
    validate_actual = map(.x=validate, ~.x$Attrition == "Yes"),
    # Prepare binary vector of predicted Attrition values for validate
    validate_predicted = map2(.x=model, .y=validate, .f=~predict(.x, .y, type = "response") > 0.5)
  )


# Calculate the validate recall for each cross validation fold
cv_perf_recall <- cv_prep_lr %>% 
  mutate(validate_recall = map2_dbl(.x=validate_actual, .y=validate_predicted, .f=~recall(actual = .x, predicted = .y)))

# Print the validate_recall column
cv_perf_recall$validate_recall

# Calculate the average of the validate_recall column
mean(cv_perf_recall$validate_recall)


library(ranger)

# Prepare for tuning your cross validation folds by varying mtry
cv_tune <- cv_data %>%
  crossing(mtry = c(2, 4, 8, 16)) 

# Build a cross validation model for each fold & mtry combination
cv_models_rf <- cv_tune %>% 
  mutate(model = map2(train, mtry, ~ranger(formula = Attrition~., 
                                           data = .x, mtry = .y,
                                           num.trees = 100, seed = 42)))


cv_prep_rf <- cv_models_rf %>% 
  mutate(
    # Prepare binary vector of actual Attrition values in validate
    validate_actual = map(validate, ~.x$Attrition == "Yes"),
    # Prepare binary vector of predicted Attrition values for validate
    validate_predicted = map2(.x=model, .y=validate, ~predict(.x, .y, type = "response")$predictions=="Yes")
  )

# Calculate the validate recall for each cross validation fold
cv_perf_recall <- cv_prep_rf %>% 
  mutate(recall = map2_dbl(.x=validate_actual, .y=validate_predicted, ~recall(actual=.x, predicted=.y)))

# Calculate the mean recall for each mtry used  
cv_perf_recall %>% 
  group_by(mtry) %>% 
  summarise(mean_recall = mean(recall))


# Build the logistic regression model using all training data
best_model <- glm(formula = Attrition ~ ., 
                  data = training_data, family = "binomial")


# Prepare binary vector of actual Attrition values for testing_data
test_actual <- testing_data$Attrition == "Yes"

# Prepare binary vector of predicted Attrition values for testing_data
test_predicted <- predict(best_model, newdata=testing_data, type = "response") > 0.5


# Compare the actual & predicted performance visually using a table
table(test_actual, test_predicted)

# Calculate the test accuracy
accuracy(test_actual, test_predicted)

# Calculate the test precision
precision(test_actual, test_predicted)

# Calculate the test recall
recall(test_actual, test_predicted)

Predictive Analytics Using Networked Data in R

Chapter 1 - Introduction, Networks, and Labeled Networks

Introduction:

  • Labeled networks and network structure for predicting node attributes
    • Predicting Age, Gender, Fraud, Churn, etc., for unknown nodes
  • Course includes labeled social networks, homophily, network featurization, predicting using supervised learning
  • Example of course collaborations - network
    • library(igraph)
    • DataScienceNetwork <- data.frame( from = c(‘A’, ‘A’, ‘A’, ‘A’, ‘B’, ‘B’, ‘C’, ‘C’, ‘D’, ‘D’, ‘D’, ‘E’, ‘F’, ‘F’, ‘G’, ‘G’, ‘H’, ‘H’, ‘I’), to = c(‘B’,‘C’,‘D’,‘E’,‘C’,‘D’,‘D’, ‘G’,‘E’, ‘F’,‘G’,‘F’,‘G’,‘I’, ‘I’,‘H’,‘I’,‘J’,‘J’))
    • g <- graph_from_data_frame(DataScienceNetwork, directed = FALSE)
    • pos <- cbind(c(2, 1, 1.5, 2.5, 4, 4. 5, 3, 3.5, 5, 6), c(10.5, 9.5, 8, 8.5, 9, 7.5, 6, 4.5, 5.5, 4))
    • plot.igraph(g, edge.label = NA, edge.color = ‘black’, layout = pos, vertex.label = V(g)$name, vertex.color = ‘white’, vertex.label.color = ‘black’, vertex.size = 25)
    • V(g)$technology <- c(‘R’,‘R’,‘?’,‘R’,‘R’, ‘R’,‘P’,‘P’,‘P’,‘P’)
    • V(g)\(color <- V(g)\)technology
    • V(g)\(color <- gsub('R',"blue3", V(g)\)color)
    • V(g)\(color <- gsub('P',"green4", V(g)\)color)
    • V(g)\(color <- gsub('?',"gray", V(g)\)color)
  • Will be using the network edgeList which has relationships among customers

Labeled Networks, Social Influence:

  • Example of using node attributes “customers” which defines customers as having churned or not
  • Supposing that churn is a social phenomenon, then churn is likely predictable based on connections
  • Can use the relational neighbor classifier
    • Percentage of neighbors with a certain trait used to predict trait of unlabeled node
    • rNeighbors <- c(4,3,3,5,3,2,3,0,1,0)
    • pNeighbors <- c(0,0,1,1,0,2,2,3,3,2)
    • rRelationalNeighbor <- rNeighbors / (rNeighbors + pNeighbors)
    • rRelationalNeighbor

Challenges:

  • Desire to evaluate models using a test set (out-of-sample performance)
    • Harder to do with networks, where connections are important (cannot just select 60% of the nodes)
    • Often managed by training on one network and testing on another
  • Observations may not be iid (in fact, there tend to be strong correlations between connected nodes)
  • Collective inferencing is another challenge
    • Infer the unknown nodes based on knowledge of how these nodes tend to interact with each other

Example code includes:

library(igraph)


# load("./RInputFiles/StudentEdgelist.RData")

# Create edgeList
elFrom <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 8, 10, 10, 11, 11, 11, 11, 11, 11, 12, 12, 12, 13, 13, 14, 14, 14, 15, 15, 15, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 18, 18, 18, 18, 19, 19, 20, 20, 20, 21, 21, 22, 23, 23, 23, 23, 24, 24, 24, 25, 25, 25, 25, 25, 26, 26, 27, 28, 28, 28, 29, 29, 29, 29, 32, 32, 32, 32, 32, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 37, 37, 583, 38, 38, 39, 39, 40, 40, 40, 40, 343, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 43, 43, 44, 44, 44, 44, 45, 45, 45, 45, 45, 45, 47, 47, 47, 47, 47, 48, 48, 49, 49, 50, 50, 51, 51, 52, 52, 53, 53, 53, 54, 54, 54, 54, 54, 54, 54, 54, 54, 55, 55, 56, 56, 56, 56, 56, 684, 57, 57, 58, 58, 58, 58, 191, 59, 59, 59, 60, 60, 60, 60, 61, 61, 61, 61, 61, 61, 61, 62, 63, 63, 63, 64, 64, 64, 64, 64, 64, 99, 65, 66, 66, 67, 68, 68, 68, 68, 68, 68, 69, 69, 69, 69, 69, 71, 71, 71, 71, 72, 72, 72, 73, 73, 73, 73, 74, 75, 76, 76, 76, 76, 76, 76, 238, 77, 77, 77, 78, 78, 78, 78, 78, 78, 78, 179, 80, 81, 707, 82, 83, 83, 83, 84, 84, 84, 84, 85, 85, 609, 85, 86, 86, 689, 86, 86, 87, 87, 87, 87, 87, 286, 459, 633, 88, 88, 89, 89, 89, 89, 90, 90, 90, 90, 91, 91, 92, 583, 92, 93, 93, 93, 94, 94, 358, 532, 95, 96, 97, 97, 707, 97, 98, 98, 98, 98, 98, 98, 98, 99, 99, 99, 99, 99, 807, 976, 101, 101, 101, 102, 102, 102, 102, 102, 784, 103, 104, 104, 105, 105, 106, 106, 106, 106, 106, 107, 583, 294, 109, 109, 110, 110, 111, 111, 111, 112, 112, 278, 113, 114, 114, 114, 114, 114, 115, 115, 115, 406, 116, 116, 116, 116, 267, 388, 117, 117, 117, 117, 117, 118, 358, 118, 119, 119, 119, 119, 119, 343, 120, 120, 121, 121, 121, 138, 207, 122, 122, 122, 122, 122, 889, 123, 123, 123, 123, 123, 124, 125, 125, 125, 125, 125, 125, 126, 126, 793, 889, 127, 128, 128, 128, 128, 128, 129, 707, 129, 337, 130, 130, 131, 131, 131, 131, 132, 133, 133, 133, 133, 532, 135, 406, 135, 186, 136, 364, 139, 139, 736, 467, 141, 288, 317, 142, 142, 142, 976, 143, 143, 143, 144, 144, 144, 889, 988, 145, 145, 146, 146, 147, 288, 148, 149, 149, 149, 661, 149, 149, 149, 172, 150, 150, 150, 245, 152, 153, 153, 154, 154, 155, 155, 155, 186, 467, 156, 157, 157, 332, 642, 190, 159, 159, 159, 159, 159, 160, 160, 161, 161, 290, 162, 162, 163, 163, 163, 531, 164, 164, 164, 165, 406, 165, 165, 165, 166, 166, 166, 166, 167, 167, 168, 168, 233, 413, 683, 931, 170, 839, 170, 171, 171, 171, 171, 171, 172, 172, 400, 506, 184, 174, 174, 174, 174, 174, 175, 176, 176, 177, 177, 177, 177, 177, 177, 177, 177, 177, 393, 178, 178, 179, 179, 179, 180, 981, 181, 181, 181, 182, 183, 183, 185, 185, 186, 270, 187, 410, 625, 189, 189, 189, 190, 190, 190, 191, 191, 191, 191, 302, 192, 193, 193, 523, 741, 195, 195, 195, 424, 196, 349, 197, 197, 246, 198, 198, 198, 198, 199, 531, 200, 201, 201, 201, 201, 201, 411, 204, 205, 205, 205, 206, 206, 206, 206, 207, 207, 207, 207, 354, 208, 209, 587, 717, 211, 211, 211, 630, 666, 211, 212, 212, 212, 545, 212, 212, 456, 889, 214, 216, 216, 216)
elFrom <- c(elFrom, 216, 216, 927, 883, 422, 946, 219, 406, 734, 874, 221, 221, 222, 222, 853, 222, 222, 222, 223, 223, 223, 224, 224, 224, 274, 406, 227, 227, 228, 228, 229, 230, 230, 230, 964, 231, 231, 231, 864, 597, 232, 233, 675, 235, 964, 237, 416, 482, 568, 236, 237, 238, 238, 238, 239, 671, 734, 975, 240, 684, 240, 650, 241, 424, 817, 243, 683, 243, 243, 307, 456, 245, 245, 246, 968, 259, 247, 247, 669, 247, 487, 248, 248, 249, 689, 250, 250, 250, 251, 251, 251, 251, 891, 375, 855, 252, 253, 253, 253, 690, 789, 255, 255, 284, 587, 730, 256, 545, 257, 257, 281, 336, 614, 258, 258, 259, 259, 975, 274, 260, 260, 260, 260, 698, 261, 817, 262, 262, 262, 262, 803, 816, 406, 264, 264, 853, 265, 488, 601, 266, 267, 269, 269, 269, 269, 269, 270, 290, 271, 271, 927, 594, 274, 274, 274, 274, 963, 276, 276, 276, 454, 278, 278, 279, 279, 279, 329, 384, 280, 281, 822, 281, 482, 282, 284, 284, 284, 284, 285, 285, 285, 286, 286, 286, 286, 287, 287, 288, 288, 288, 289, 290, 449, 291, 669, 291, 788, 659, 292, 293, 522, 296, 296, 803, 296, 297, 299, 300, 300, 300, 300, 643, 301, 302, 302, 302, 302, 303, 303, 303, 303, 531, 704, 304, 305, 789, 306, 306, 306, 306, 308, 336, 309, 309, 745, 312, 312, 313, 313, 313, 637, 649, 790, 802, 960, 343, 524, 316, 317, 317, 318, 335, 319, 319, 319, 689, 320, 964, 321, 321, 321, 321, 323, 323, 323, 324, 324, 868, 698, 327, 529, 329, 329, 329, 329, 960, 332, 746, 811, 822, 865, 335, 334, 334, 334, 424, 335, 335, 336, 336, 336, 633, 931, 401, 341, 407, 412, 842, 343, 343, 343, 532, 637, 344, 553, 597, 399, 758, 346, 347, 347, 348, 348, 348, 348, 349, 349, 349, 817, 351, 351, 807, 855, 353, 353, 976, 354, 354, 354, 356, 356, 371, 358, 358, 707, 859, 994, 690, 360, 360, 361, 361, 361, 361, 973, 375, 362, 784, 365, 365, 365, 732, 890, 972, 637, 368, 369, 369, 369, 370, 643, 370, 370, 371, 372, 404, 373, 997, 375, 375, 376, 568, 377, 377, 378, 402, 840, 380, 425, 465, 569, 382, 382, 803, 894, 384, 384, 608, 526, 388, 388, 390, 390, 390, 390, 390, 390, 672, 392, 393, 660, 395, 395, 743, 397, 398, 399, 996, 460, 401, 401, 401, 402, 403, 422, 403, 413, 406, 406, 406, 406, 408, 408, 930, 409, 409, 409, 409, 410, 410, 410, 411, 411, 412, 413, 414, 415, 415, 855, 416, 416, 445, 417, 417, 417, 871, 420, 946, 420, 420, 422, 877, 423, 424, 424, 425, 426, 822, 430, 429, 429, 429, 732, 429, 429, 430, 433, 669, 433, 467, 433, 434, 435, 436, 436, 812, 437, 437, 853, 963, 438, 439, 446, 440, 608, 440, 914, 441, 692, 441, 862, 442, 442, 443, 446, 744, 446, 880, 449, 482, 556, 912, 957, 453, 453, 453, 994, 454, 454, 551, 561, 998, 457, 457, 788, 473, 458, 689, 459, 689, 460, 460, 679, 812, 884, 886, 717, 741, 466, 467, 467, 760, 834, 529, 738, 470, 470, 503, 471, 471, 472, 472, 473, 473, 999, 474, 474, 474, 643, 835, 476, 690, 859, 478, 479, 877, 480, 480, 480, 503, 774, 482, 483, 584, 733, 487, 487, 488, 488, 642, 488, 577, 491, 802, 492, 493, 494, 494, 498, 496, 496, 642, 726, 497, 914, 498, 499, 499, 499, 500, 500, 942, 581, 915, 501, 543, 701, 770, 503, 503, 748, 506, 506, 506, 506, 737, 982, 913, 510, 566, 511, 692, 767, 932, 737, 660, 742, 552, 518, 700, 854, 640, 583, 521, 521, 522, 522, 524, 683, 718, 525, 526, 527, 759, 529, 530, 530, 998, 531, 531, 531, 532, 532, 532, 532, 532, 577)
elFrom <- c(elFrom, 533, 533, 877, 534, 534, 534, 734, 535, 535, 561, 536, 538, 538, 538, 539, 614, 652, 541, 541, 541, 997, 543, 543, 543, 946, 545, 791, 549, 550, 963, 705, 809, 554, 554, 555, 912, 556, 556, 556, 556, 557, 958, 632, 665, 719, 560, 560, 561, 570, 628, 707, 777, 923, 976, 564, 564, 738, 564, 564, 565, 565, 566, 566, 567, 567, 567, 569, 570, 958, 679, 958, 908, 638, 736, 577, 653, 808, 607, 645, 774, 865, 636, 583, 583, 999, 585, 585, 587, 840, 609, 589, 589, 995, 592, 874, 625, 698, 594, 595, 596, 822, 732, 823, 607, 771, 724, 804, 600, 839, 601, 731, 800, 854, 645, 607, 607, 607, 607, 608, 609, 610, 610, 891, 943, 613, 615, 800, 839, 638, 619, 619, 883, 620, 620, 768, 622, 877, 623, 623, 623, 633, 892, 889, 787, 984, 748, 761, 929, 690, 930, 630, 630, 632, 633, 685, 634, 634, 635, 635, 635, 637, 968, 883, 638, 638, 816, 642, 669, 643, 643, 644, 646, 895, 740, 649, 650, 650, 961, 651, 651, 856, 730, 849, 680, 908, 960, 661, 662, 931, 955, 666, 666, 669, 669, 669, 670, 671, 671, 768, 672, 812, 818, 675, 677, 677, 677, 679, 679, 679, 889, 683, 684, 732, 685, 876, 770, 687, 788, 690, 692, 692, 692, 692, 692, 790, 916, 698, 718, 722, 743, 937, 702, 704, 707, 759, 917, 713, 713, 732, 715, 892, 717, 718, 874, 921, 942, 744, 722, 891, 968, 724, 724, 726, 726, 881, 732, 733, 734, 793, 892, 740, 970, 742, 743, 744, 744, 784, 746, 886, 807, 770, 996, 994, 856, 878, 813, 769, 771, 902, 956, 982, 780, 810, 874, 825, 873, 840, 924, 789, 988, 790, 811, 791, 875, 795, 795, 800, 802, 964, 805, 980, 808, 808, 808, 874, 874, 811, 922, 819, 868, 929, 881, 822, 822, 952, 865, 963, 981, 988, 872, 832, 839, 960, 851, 855, 855, 860, 918, 861, 875, 875, 984, 888, 890, 897, 901, 998, 905, 961, 907, 918, 988, 976, 982, 922, 923, 924, 947, 970, 974, 997, 999, 942, 952, 984, 992)
elTo <- c(250, 308, 413, 525, 803, 894, 332, 433, 474, 847, 963, 968, 147, 290, 337, 393, 474, 179, 193, 233, 737, 793, 838, 684, 718, 237, 404, 698, 724, 285, 641, 86, 285, 376, 689, 758, 889, 145, 410, 544, 583, 835, 96, 788, 924, 43, 91, 446, 181, 289, 378, 406, 547, 784, 189, 399, 482, 822, 262, 308, 817, 832, 260, 997, 81, 229, 839, 56, 840, 183, 186, 397, 676, 760, 344, 534, 980, 303, 343, 395, 925, 988, 483, 522, 132, 335, 506, 643, 304, 704, 871, 872, 466, 524, 567, 683, 997, 264, 279, 896, 105, 356, 460, 568, 726, 789, 865, 902, 951, 988, 138, 293, 38, 614, 633, 224, 550, 64, 224, 463, 521, 41, 347, 566, 746, 885, 99, 424, 442, 459, 571, 613, 689, 807, 84, 106, 257, 883, 191, 222, 265, 631, 681, 853, 207, 296, 546, 726, 866, 161, 665, 640, 816, 160, 669, 284, 313, 371, 973, 270, 407, 748, 230, 410, 445, 587, 644, 651, 936, 961, 964, 320, 804, 284, 476, 506, 755, 919, 57, 730, 754, 85, 259, 609, 975, 59, 278, 360, 413, 454, 589, 609, 889, 170, 184, 215, 365, 426, 707, 828, 548, 294, 479, 671, 473, 497, 642, 914, 942, 999, 65, 358, 491, 669, 326, 310, 531, 717, 852, 882, 960, 172, 331, 416, 552, 643, 453, 607, 732, 994, 245, 428, 943, 97, 255, 279, 570, 238, 412, 235, 271, 532, 722, 927, 964, 77, 286, 638, 736, 216, 351, 526, 745, 911, 927, 971, 79, 970, 839, 82, 814, 288, 556, 595, 456, 560, 563, 976, 98, 249, 85, 802, 149, 661, 86, 788, 953, 312, 630, 696, 740, 793, 88, 88, 88, 682, 864, 400, 424, 460, 947, 208, 354, 683, 744, 467, 946, 195, 92, 932, 402, 411, 828, 321, 670, 95, 95, 931, 766, 129, 458, 97, 968, 116, 161, 336, 422, 493, 802, 877, 281, 442, 651, 826, 907, 100, 100, 352, 364, 414, 190, 618, 839, 914, 924, 103, 887, 635, 961, 182, 772, 380, 488, 510, 662, 670, 520, 107, 109, 584, 597, 267, 388, 338, 862, 933, 691, 998, 113, 763, 475, 506, 790, 835, 898, 420, 494, 819, 116, 440, 535, 692, 886, 117, 117, 437, 661, 853, 876, 877, 219, 118, 798, 156, 672, 731, 766, 825, 120, 972, 981, 551, 842, 963, 122, 122, 349, 527, 529, 650, 770, 122, 150, 354, 377, 811, 922, 617, 130, 190, 337, 523, 742, 880, 409, 539, 126, 127, 937, 152, 159, 448, 527, 591, 317, 129, 973, 130, 702, 831, 324, 472, 492, 659, 436, 168, 172, 515, 538, 134, 329, 135, 714, 136, 706, 137, 241, 912, 140, 141, 660, 142, 142, 470, 735, 773, 142, 282, 705, 930, 300, 625, 715, 144, 144, 319, 774, 610, 741, 375, 148, 909, 383, 585, 593, 149, 718, 874, 894, 150, 178, 545, 873, 152, 487, 768, 999, 868, 995, 269, 560, 860, 156, 156, 645, 425, 849, 158, 158, 159, 342, 407, 595, 620, 849, 514, 645, 553, 802, 162, 549, 981, 239, 540, 864, 164, 537, 795, 800, 391, 165, 630, 761, 891, 333, 427, 581, 982, 855, 907, 398, 515, 169, 169, 169, 169, 184, 170, 890, 176, 356, 613, 679, 884, 426, 796, 173, 173, 174, 211, 274, 403, 734, 812, 511, 302, 363, 204, 246, 251, 283, 465, 557, 664, 704, 891, 178, 884, 915, 242, 302, 637, 273, 180, 289, 587, 696, 569, 671, 831, 305, 789, 277, 187, 341, 188, 188, 222, 439, 590, 620, 867, 880, 278, 356, 968, 989, 192, 419, 464, 673, 194, 194, 372, 403, 863, 196, 810, 197, 601, 666, 198, 401, 716, 719, 771, 594, 200, 756, 381, 666, 688, 788, 867, 202, 979, 498, 525, 898, 253, 539, 818, 859, 236, 448, 790, 880, 208, 879, 906, 210, 210, 261, 384, 545, 211, 211, 817, 346, 443, 451, 212, 752, 947, 213, 213, 674, 300, 322, 543, 671)
elTo <- c(elTo, 881, 216, 217, 218, 218, 239, 220, 220, 220, 649, 957, 265, 269, 222, 918, 920, 989, 675, 692, 809, 394, 463, 854, 225, 225, 658, 960, 307, 975, 695, 791, 811, 905, 230, 449, 452, 556, 231, 232, 769, 886, 234, 700, 235, 236, 236, 236, 236, 929, 503, 769, 822, 905, 517, 239, 239, 239, 297, 240, 808, 241, 769, 242, 242, 374, 243, 744, 930, 244, 244, 251, 642, 922, 246, 247, 291, 444, 247, 690, 248, 663, 944, 632, 249, 515, 769, 986, 334, 335, 465, 801, 251, 252, 252, 870, 322, 402, 721, 254, 254, 627, 852, 256, 256, 256, 987, 257, 909, 996, 258, 258, 258, 956, 996, 392, 845, 259, 260, 567, 600, 769, 916, 261, 781, 261, 306, 319, 323, 776, 263, 263, 264, 420, 795, 265, 862, 266, 266, 958, 876, 602, 743, 862, 888, 911, 441, 271, 324, 608, 271, 272, 437, 564, 600, 824, 274, 699, 743, 974, 277, 593, 821, 724, 865, 897, 280, 280, 992, 774, 281, 991, 282, 705, 398, 476, 730, 919, 746, 758, 823, 342, 635, 800, 821, 370, 390, 420, 556, 861, 779, 324, 291, 648, 291, 688, 291, 292, 954, 953, 296, 533, 577, 296, 849, 357, 596, 668, 715, 732, 881, 301, 753, 369, 419, 518, 987, 500, 655, 672, 825, 304, 304, 919, 653, 305, 434, 519, 540, 612, 894, 309, 485, 691, 311, 316, 617, 557, 596, 621, 314, 314, 314, 314, 314, 315, 316, 856, 537, 714, 895, 319, 361, 391, 534, 320, 804, 320, 590, 670, 875, 985, 447, 733, 812, 363, 655, 325, 326, 836, 328, 492, 621, 673, 984, 330, 901, 333, 333, 333, 333, 334, 417, 588, 966, 335, 589, 954, 614, 682, 802, 337, 338, 340, 368, 342, 342, 342, 588, 716, 892, 344, 344, 980, 345, 345, 346, 346, 834, 558, 623, 489, 559, 632, 850, 359, 588, 770, 351, 849, 971, 352, 352, 603, 854, 353, 819, 843, 856, 789, 823, 357, 536, 561, 359, 359, 359, 360, 884, 942, 450, 558, 701, 918, 361, 362, 461, 364, 580, 598, 685, 365, 366, 366, 367, 713, 499, 571, 700, 390, 370, 753, 921, 994, 863, 373, 471, 373, 508, 942, 435, 377, 645, 917, 948, 379, 379, 680, 382, 382, 382, 579, 768, 383, 383, 738, 837, 385, 386, 501, 636, 435, 462, 499, 606, 794, 923, 392, 845, 870, 394, 866, 987, 396, 909, 856, 758, 399, 400, 417, 722, 823, 721, 419, 403, 550, 405, 547, 886, 934, 999, 486, 870, 408, 470, 581, 600, 604, 502, 961, 964, 430, 441, 897, 803, 897, 628, 833, 415, 798, 908, 417, 588, 966, 984, 419, 861, 420, 983, 985, 479, 422, 777, 913, 954, 887, 796, 428, 429, 578, 652, 667, 429, 739, 808, 759, 431, 431, 452, 433, 920, 554, 923, 580, 673, 436, 445, 743, 437, 437, 480, 647, 440, 505, 440, 709, 440, 603, 441, 772, 441, 777, 878, 752, 668, 446, 955, 448, 577, 450, 450, 450, 452, 607, 678, 787, 453, 806, 980, 455, 455, 455, 468, 584, 457, 458, 530, 458, 538, 459, 902, 979, 461, 461, 462, 462, 463, 466, 939, 797, 813, 468, 468, 469, 469, 905, 949, 471, 516, 949, 701, 800, 521, 530, 473, 644, 759, 987, 475, 475, 495, 477, 477, 799, 676, 479, 507, 605, 634, 481, 481, 929, 610, 484, 484, 638, 682, 544, 571, 488, 918, 490, 510, 492, 950, 773, 634, 876, 495, 829, 955, 497, 497, 833, 497, 652, 606, 696, 767, 655, 825, 500, 501, 501, 938, 502, 502, 502, 558, 949, 504, 571, 724, 766, 790, 508, 508, 509, 615, 511, 654, 513, 513, 513, 516, 517, 517, 518, 598, 518, 518, 519, 520, 617, 622, 733, 952, 567, 524, 524, 528, 541, 820, 528, 699, 575, 903, 530, 697, 871, 882, 565, 575, 590, 682, 968, 533, 622, 753)
elTo <- c(elTo, 533, 903, 923, 980, 535, 926, 970, 536, 791, 600, 709, 934, 544, 540, 541, 703, 866, 944, 541, 701, 799, 824, 543, 850, 547, 786, 606, 551, 554, 554, 827, 939, 711, 555, 729, 855, 872, 966, 621, 558, 559, 559, 559, 582, 636, 998, 562, 562, 562, 562, 562, 563, 604, 651, 564, 786, 844, 782, 836, 654, 810, 612, 769, 916, 910, 777, 570, 573, 573, 575, 576, 576, 656, 578, 578, 579, 579, 579, 579, 582, 606, 968, 584, 851, 857, 636, 588, 589, 711, 967, 590, 677, 593, 594, 594, 825, 981, 621, 596, 598, 598, 599, 599, 600, 600, 934, 601, 875, 602, 603, 603, 604, 798, 849, 940, 993, 902, 967, 611, 967, 611, 611, 843, 686, 618, 618, 619, 764, 876, 619, 916, 926, 622, 778, 622, 703, 720, 819, 624, 624, 625, 626, 626, 627, 627, 627, 629, 629, 740, 789, 892, 779, 634, 765, 813, 728, 821, 873, 644, 637, 638, 895, 934, 640, 694, 643, 753, 861, 987, 708, 647, 648, 957, 722, 905, 650, 786, 844, 654, 656, 656, 657, 658, 658, 852, 786, 662, 664, 751, 817, 694, 711, 985, 786, 734, 831, 672, 825, 673, 674, 725, 775, 798, 945, 749, 884, 886, 679, 931, 754, 685, 814, 686, 687, 783, 688, 942, 695, 790, 809, 833, 910, 695, 697, 725, 699, 699, 699, 699, 928, 919, 777, 708, 709, 749, 935, 715, 935, 716, 852, 937, 719, 720, 721, 722, 905, 723, 723, 829, 865, 865, 966, 728, 881, 796, 755, 737, 740, 926, 740, 830, 974, 753, 803, 745, 791, 749, 754, 757, 758, 760, 762, 764, 765, 846, 873, 775, 775, 776, 977, 781, 781, 782, 785, 786, 788, 876, 789, 887, 791, 905, 792, 857, 858, 810, 869, 804, 990, 806, 875, 928, 955, 810, 811, 884, 811, 843, 819, 820, 821, 959, 991, 827, 829, 830, 831, 831, 832, 928, 965, 840, 857, 870, 872, 910, 860, 985, 941, 985, 887, 969, 977, 954, 904, 903, 949, 905, 933, 910, 910, 912, 921, 937, 1000, 940, 925, 926, 928, 933, 934, 940, 940, 966, 966)

edgeList <- data.frame(from=elFrom, 
                       to=elTo, 
                       stringsAsFactors = FALSE
                       )

# Inspect edgeList
str(edgeList)
## 'data.frame':    1663 obs. of  2 variables:
##  $ from: num  1 1 1 1 1 1 2 2 2 2 ...
##  $ to  : num  250 308 413 525 803 894 332 433 474 847 ...
head(edgeList)
##   from  to
## 1    1 250
## 2    1 308
## 3    1 413
## 4    1 525
## 5    1 803
## 6    1 894
# Construct the igraph object
network <- graph_from_data_frame(edgeList, directed = FALSE)

# View your igraph object
network
## IGRAPH d63def4 UN-- 956 1663 -- 
## + attr: name (v/c)
## + edges from d63def4 (vertex names):
##  [1] 1 --250 1 --308 1 --413 1 --525 1 --803 1 --894 2 --332 2 --433
##  [9] 2 --474 2 --847 2 --963 2 --968 3 --147 3 --290 3 --337 3 --393
## [17] 3 --474 4 --179 4 --193 4 --233 5 --737 5 --793 5 --838 6 --684
## [25] 6 --718 7 --237 7 --404 8 --698 8 --724 10--285 10--641 11--86 
## [33] 11--285 11--376 11--689 11--758 11--889 12--145 12--410 12--544
## [41] 13--583 13--835 14--96  14--788 14--924 15--43  15--91  15--446
## [49] 16--181 16--289 16--378 16--406 16--547 16--784 17--189 17--399
## [57] 17--482 17--822 18--262 18--308 18--817 18--832 19--260 19--997
## + ... omitted several edges
# load("./RInputFiles/StudentCustomers.RData")

custID <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651)
custID <- c(custID, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956)
custChurn <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
custChurn <- c(custChurn, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

customers <- data.frame(id=custID, churn=custChurn)
# Inspect the customers dataframe
str(customers)
## 'data.frame':    956 obs. of  2 variables:
##  $ id   : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ churn: num  0 0 0 0 0 0 0 0 0 0 ...
head(customers)
##   id churn
## 1  1     0
## 2  2     0
## 3  3     0
## 4  4     0
## 5  5     0
## 6  6     0
# Count the number of churners and non-churners
table(customers$churn)
## 
##   0   1 
## 926  30
matchID <- match(V(network), customers$id)
churnID <- customers$churn[matchID]
table(churnID)
## churnID
##   0   1 
## 926  30
churnID[is.na(churnID)] <- 0
table(churnID)
## churnID
##   0   1 
## 926  30
# Add a node attribute called churn
V(network)$churn <- churnID


# useVerts <- c('1', '10', '100', '1000', '101', '102', '103', '104', '105', '106', '107', '109', '11', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '12', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '13', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '14', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '15', '150', '152', '153', '154', '155', '156', '157', '158', '159', '16', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '17', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '18', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '19', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '2', '20', '200', '201', '202', '204', '205', '206', '207', '208', '209', '21', '210', '211', '212', '213', '214', '215', '216', '217', '218', '219', '22', '220', '221', '222', '223', '224', '225', '227', '228', '229', '23', '230', '231', '232', '233', '234', '235', '236', '237', '238', '239', '24', '240', '241', '242', '243', '244', '245', '246', '247', '248', '249', '25', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '26', '260', '261', '262', '263', '264', '265', '266', '267', '269', '27', '270', '271', '272', '273', '274', '276', '277', '278', '279', '28', '280', '281', '282', '283', '284', '285', '286', '287', '288', '289', '29', '290', '291', '292', '293', '294', '296', '297', '299', '3', '300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '310', '311', '312', '313', '314', '315', '316', '317', '318', '319', '32', '320', '321', '322', '323', '324', '325', '326', '327', '328', '329', '330', '331', '332', '333', '334', '335', '336', '337', '338', '34', '340', '341', '342', '343', '344', '345', '346', '347', '348', '349', '35', '351', '352', '353', '354', '356', '357', '358', '359', '360', '361', '362', '363', '364', '365', '366', '367', '368', '369', '37', '370', '371', '372', '373', '374', '375', '376', '377', '378', '379', '38', '380', '381', '382', '383', '384', '385', '386', '388', '39', '390', '391', '392', '393', '394', '395', '396', '397', '398', '399', '4', '40', '400', '401', '402', '403', '404', '405', '406', '407', '408', '409', '41', '410', '411', '412', '413', '414', '415', '416', '417', '419', '42', '420', '422', '423', '424', '425', '426', '427', '428', '429', '43', '430', '431', '433', '434', '435', '436', '437', '438', '439', '44', '440', '441', '442', '443', '444', '445', '446', '447', '448', '449', '45', '450', '451', '452', '453', '454', '455', '456', '457', '458', '459', '460', '461', '462', '463', '464', '465', '466', '467', '468', '469', '47', '470', '471', '472', '473', '474', '475', '476', '477', '478', '479', '48', '480', '481', '482', '483', '484', '485', '486', '487', '488', '489', '49', '490', '491', '492', '493', '494', '495', '496', '497', '498', '499')
# useVerts <- c(useVerts, '5', '50', '500', '501', '502', '503', '504', '505', '506', '507', '508', '509', '51', '510', '511', '513', '514', '515', '516', '517', '518', '519', '52', '520', '521', '522', '523', '524', '525', '526', '527', '528', '529', '53', '530', '531', '532', '533', '534', '535', '536', '537', '538', '539', '54', '540', '541', '543', '544', '545', '546', '547', '548', '549', '55', '550', '551', '552', '553', '554', '555', '556', '557', '558', '559', '56', '560', '561', '562', '563', '564', '565', '566', '567', '568', '569', '57', '570', '571', '573', '575', '576', '577', '578', '579', '58', '580', '581', '582', '583', '584', '585', '587', '588', '589', '59', '590', '591', '592', '593', '594', '595', '596', '597', '598', '599', '6', '60', '600', '601', '602', '603', '604', '605', '606', '607', '608', '609', '61', '610', '611', '612', '613', '614', '615', '617', '618', '619', '62', '620', '621', '622', '623', '624', '625', '626', '627', '628', '629', '63', '630', '631', '632', '633', '634', '635', '636', '637', '638', '64', '640', '641', '642', '643', '644', '645', '646', '647', '648', '649', '65', '650', '651', '652', '653', '654', '655', '656', '657', '658', '659', '66', '660', '661', '662', '663', '664', '665', '666', '667', '668', '669', '67', '670', '671', '672', '673', '674', '675', '676', '677', '678', '679', '68', '680', '681', '682', '683', '684', '685', '686', '687', '688', '689', '69', '690', '691', '692', '694', '695', '696', '697', '698', '699', '7', '700', '701', '702', '703', '704', '705', '706', '707', '708', '709', '71', '711', '713', '714', '715', '716', '717', '718', '719', '72', '720', '721', '722', '723', '724', '725', '726', '728', '729', '73', '730', '731', '732', '733', '734', '735', '736', '737', '738', '739', '74', '740', '741', '742', '743', '744', '745', '746', '748', '749', '75', '751', '752', '753', '754', '755', '756', '757', '758', '759', '76', '760', '761', '762', '763', '764', '765', '766', '767', '768', '769', '77', '770', '771', '772', '773', '774', '775', '776', '777', '778', '779', '78', '780', '781', '782', '783', '784', '785', '786', '787', '788', '789', '79', '790', '791', '792', '793', '794', '795', '796', '797', '798', '799', '8', '80', '800', '801', '802', '803', '804', '805', '806', '807', '808', '809', '81', '810', '811', '812', '813', '814', '816', '817', '818', '819', '82', '820', '821', '822', '823', '824', '825', '826', '827', '828', '829', '83', '830', '831', '832', '833', '834', '835', '836', '837', '838', '839', '84', '840', '842', '843', '844', '845', '846', '847', '849', '85', '850', '851', '852', '853', '854', '855', '856', '857', '858', '859', '86', '860', '861', '862', '863', '864', '865', '866', '867', '868', '869', '87', '870', '871', '872', '873', '874', '875', '876', '877', '878', '879', '88', '880', '881', '882', '883', '884', '885', '886', '887', '888', '889', '89', '890', '891', '892', '894', '895', '896', '897', '898', '90', '901', '902', '903', '904', '905', '906', '907', '908', '909', '91', '910', '911', '912', '913', '914', '915', '916', '917', '918', '919', '92', '920', '921', '922', '923', '924', '925', '926', '927', '928', '929', '93', '930', '931', '932', '933', '934', '935', '936', '937', '938', '939', '94', '940', '941', '942', '943', '944', '945', '946', '947', '948', '949', '95', '950', '951', '952', '953', '954', '955', '956', '957', '958', '959', '96', '960', '961', '963', '964', '965', '966', '967', '968', '969', '97', '970', '971', '972', '973', '974', '975', '976', '977', '979', '98', '980', '981', '982', '983', '984', '985', '986', '987', '988', '989', '99', '990', '991', '992', '993', '994', '995', '996', '997', '998', '999')

# useVertNums <- match(useVerts, V(network))
# useNetwork <- induced_subgraph(network, useVertNums)
useNetwork <- network
useNetwork
## IGRAPH d63def4 UN-- 956 1663 -- 
## + attr: name (v/c), churn (v/n)
## + edges from d63def4 (vertex names):
##  [1] 1 --250 1 --308 1 --413 1 --525 1 --803 1 --894 2 --332 2 --433
##  [9] 2 --474 2 --847 2 --963 2 --968 3 --147 3 --290 3 --337 3 --393
## [17] 3 --474 4 --179 4 --193 4 --233 5 --737 5 --793 5 --838 6 --684
## [25] 6 --718 7 --237 7 --404 8 --698 8 --724 10--285 10--641 11--86 
## [33] 11--285 11--376 11--689 11--758 11--889 12--145 12--410 12--544
## [41] 13--583 13--835 14--96  14--788 14--924 15--43  15--91  15--446
## [49] 16--181 16--289 16--378 16--406 16--547 16--784 17--189 17--399
## [57] 17--482 17--822 18--262 18--308 18--817 18--832 19--260 19--997
## + ... omitted several edges
# Visualize the network (pretty messy)
plot(useNetwork, vertex.label = NA, edge.label = NA, edge.color = 'black', vertex.size = 2)

# Add a node attribute called color
V(useNetwork)$color <- V(useNetwork)$churn

# Change the color of churners to red and non-churners to white
V(useNetwork)$color <- gsub("1", "red", V(useNetwork)$color) 
V(useNetwork)$color <- gsub("0", "white", V(useNetwork)$color)

# Plot the network (pretty messy)
plot(useNetwork, vertex.label = NA, edge.label = NA, edge.color = 'black', vertex.size = 2)

# Create a subgraph with only churners
churnerNetwork <- induced_subgraph(useNetwork, v = V(useNetwork)[which(V(useNetwork)$churn == 1)])
                    
# Plot the churner network 
plot(churnerNetwork, vertex.label = NA, vertex.size = 2)

ctNeighbors <- function(v) {
    tmp <- V(useNetwork)[neighbors(useNetwork, v, mode="all")]$churn
    c(sum(tmp==0), sum(tmp==1))
}
mtxNeighbors <- sapply(V(useNetwork), FUN=ctNeighbors)
NonChurnNeighbors <- mtxNeighbors[1, ]
ChurnNeighbors <- mtxNeighbors[2, ]


# Compute the churn probabilities
churnProb <- ChurnNeighbors / (ChurnNeighbors + NonChurnNeighbors)

# Find who is most likely to churn
mostLikelyChurners <- which(churnProb == max(churnProb))

# Extract the IDs of the most likely churners
customers$id[mostLikelyChurners]
## [1]  21 729 764 922
# Find churn probability of the 44th customer
churnProb[44]
## 49 
##  0
# Update the churn probabilties and the non-churn probabilities
AdjacencyMatrix <- as_adjacency_matrix(useNetwork)
nNeighbors <- colSums(mtxNeighbors)
churnProb_updated <- as.vector((AdjacencyMatrix %*% churnProb) / nNeighbors)

# Find updated churn probability of the 44th customer
churnProb_updated[44]
## [1] 0.3333333
# Compute the AUC
pROC::auc(churnID, as.vector(churnProb))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.5091
# Write a for loop to update the probabilities
for(i in 1:10){
    churnProb <- as.vector((AdjacencyMatrix %*% churnProb) / nNeighbors)
}

# Compute the AUC again
pROC::auc(churnID, as.vector(churnProb))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6753

Chapter 2 - Homophily

Homophily:

  • Social networks tend to have reasons for connections (sharing of common properties such as interests, locations, etc.)
  • Homophily is a scientific term for “birds of a feather flock together” (similar nodes are more likely to connect than dissimilar nodes)
  • Can define different types of edges to classify the degree of homophily
    • edge_rr<-sum(E(g)$label==‘rr’) # connects r to r
    • edge_pp<-sum(E(g)$label==‘pp’) # connects p to p
    • edge_rp<-sum(E(g)$label==‘rp’) # connects r to p or p to r
    • p <- 2edges/nodes(nodes-1)

Dyadicity:

  • Dyadicity is a measure of connectedness among nodes with the same attributes
    • Comparison is to a random configuration of the network (1 would be equal to expected, >1 would be greater than expected)
    • Dyadic (>1) - Random (~1) - Anti-Dyadic (<1)

Heterophilicity:

  • Connectedness among nodes with opposite labels
    • Expected number of connections is nA * nB * p (where p is probability of connection across the network)
  • Heterophilic (>1), Random (~1), and Heterophobic (<1) networks are all possible

Summary:

  • Need to evaluate whether node attributes have important relationships - is there structure among the node connections
  • Dyadicity (connectiveness to same type of nodes, relative to expected / average network connectiveness)
  • Hetrophilicity (connectedness to different type of nodes, relative to expected / average network connectiveness)

Example code includes:

# Add the column edgeList$FromLabel
edgeList$FromLabel <- customers[match(edgeList$from, customers$id), 2]
edgeList$FromLabel[is.na(edgeList$FromLabel)] <- 0

# Add the column edgeList$ToLabel
edgeList$ToLabel <- customers[match(edgeList$to, customers$id), 2]
edgeList$ToLabel[is.na(edgeList$ToLabel)] <- 0

# Add the column edgeList$edgeType
edgeList$edgeType <- edgeList$FromLabel + edgeList$ToLabel
 
# Count the number of each type of edge
table(edgeList$edgeType)
## 
##    0    1    2 
## 1544  117    2
# Count churn edges
ChurnEdges <- sum(edgeList$edgeType == 2)
 
# Count non-churn edges
NonChurnEdges <- sum(edgeList$edgeType == 0)
 
# Count mixed edges
MixedEdges <- sum(edgeList$edgeType == 1)
 
# Count all edges
edges <- ChurnEdges + NonChurnEdges + MixedEdges

#Print hte number of edges
edges
## [1] 1663
# Count the number of churn nodes
ChurnNodes <- sum(customers$churn == 1)
 
# Count the number of non-churn nodes
NonChurnNodes <- sum(customers$churn == 0)
 
# Count the total number of nodes
nodes <- ChurnNodes + NonChurnNodes
 
# Compute the network connectance
connectance <- 2 * edges / nodes / (nodes - 1)

# Print the value
connectance
## [1] 0.003643015
# Compute the expected churn dyadicity
ExpectedDyadChurn <- ChurnNodes * (ChurnNodes - 1) * connectance / 2
 
# Compute the churn dyadicity
DyadChurn <- ChurnEdges / ExpectedDyadChurn
 
# Inspect the value
DyadChurn
## [1] 1.262059
# Compute the expected heterophilicity
ExpectedHet <- NonChurnNodes * ChurnNodes * connectance
 
# Compute the heterophilicity
Het <- MixedEdges / ExpectedHet
 
# Inspect the heterophilicity
Het
## [1] 1.156093

Chapter 3 - Network Featurization

Basic Network Features:

  • Provided that the labels of the nodes depend on each other, they can be useful for predictive modeling
  • Can begin by getting neighborhood features
    • degree(g) # first order degree
    • neighborhood.size(g, order=2) # second order degree, or size of neighborhoods counting all 2+ away
    • count_triangles(g) # triangles that each node is part of (triangle has full connection among three nodes)
  • Can also look at the centrality features
    • Betweenness - measure of how often the node is part of a quickest path between two other nodes
    • betweenness(g)
    • Closeness - measure of how easily a node can reach the other nodes
    • closeness(g)
  • Can also look at the transitivity (clustering coefficient)
    • transitivity(g,type = ‘local’) # computed for each node separately
    • Triangles vs. Triangles + Triads (triangles with a missing edge)

Link-Based Features:

  • Link-based features are network features that depend on adjacencies
    • The adjacency matrix has all nodes on both the x and y axis, with a 1 meaning they are linked by an edge
    • A <- get.adjacency(g)
    • Can then run the matrix multiplication (dot product) against (for example) a vector indicating preferences
    • preference <- c(1,1,1,1,1,1,0,0,0,0)
    • rNeighbors <- A %*% preference
    • as.vector(rNeighbors)
  • Could instad get something like the average age of the neighbors
    • age <- c(23,65,33,36,28,45,41,24,38,39)
    • degree <- degree(g)
    • averageAge <- A %*% age / degree

Page Rank:

  • The page rank is the basis of search algorithms such as google
  • The page rank algorithm assumes that clicks on one page lead to the next page, and checks the degree of connectivity
    • Page rank of pages that link to it
    • Number of links that the referring pages include
  • The page rank formula for all pages simultaneously is PR = alpha * A * PR + (1 - alpha) * epsilon
    • The alpha is the likelihood of a link being clicked, assumed to be uniform, assuming 85% as the default
    • The A is the adjacency matrix
    • The PR is the page rank, and is solved with either matrix inversions or iterations
    • page.rank(g)
    • page.rank(g, personalized = c(1,0,0,0,0,0,0,0,0,0)) # a personalized argument where the first node connections drive higher values

Example code includes:

# Extract network degree
V(network)$degree <- degree(network, normalized=TRUE)

# Extraxt 2.order network degree
degree2 <- neighborhood.size(network, 2)

# Normalize 2.order network degree
V(network)$degree2 <- degree2 / (length(V(network)) - 1)

# Extract number of triangles
V(network)$triangles <- count_triangles(network)


# Extract the betweenness
V(network)$betweenness <- betweenness(network, normalized=TRUE)

# Extract the closeness
V(network)$closeness <- closeness(network, normalized=TRUE)
## Warning in closeness(network, normalized = TRUE): At centrality.c:
## 2784 :closeness centrality is not well-defined for disconnected graphs
# Extract the eigenvector centrality
V(network)$eigenCentrality <- eigen_centrality(network, scale = TRUE)$vector


# Extract the local transitivity
V(network)$transitivity <- transitivity(network, type="local", isolates='zero')

# Compute the network's transitivity
transitivity(network)
## [1] 0.1002653
# Extract the adjacency matrix
AdjacencyMatrix <- as_adjacency_matrix(network)

# Compute the second order matrix
SecondOrderMatrix <- AdjacencyMatrix %*% AdjacencyMatrix

# Adjust the second order matrix
SecondOrderMatrix_adj <- ((SecondOrderMatrix) > 0) + 0
diag(SecondOrderMatrix_adj) <- 0

# Inspect the second order matrix
SecondOrderMatrix_adj[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 10 column names '1', '2', '3' ... ]]
##                       
## 1  0 . . . . . . . . .
## 2  . 0 1 . . . . . . .
## 3  . 1 0 . . . . . . .
## 4  . . . 0 . . . . . .
## 5  . . . . 0 . . . . .
## 6  . . . . . 0 . . . .
## 7  . . . . . . 0 . . .
## 8  . . . . . . . 0 . .
## 10 . . . . . . . . 0 1
## 11 . . . . . . . . 1 0
# Compute the number of churn neighbors
V(network)$ChurnNeighbors <- as.vector(AdjacencyMatrix %*% V(network)$churn)

# Compute the number of non-churn neighbors
V(network)$NonChurnNeighbors <- as.vector(AdjacencyMatrix %*% (1 - V(network)$churn))

# Compute the relational neighbor probability
V(network)$RelationalNeighbor <- as.vector(V(network)$ChurnNeighbors / 
    (V(network)$ChurnNeighbors + V(network)$NonChurnNeighbors))


# Compute the number of churners in the second order neighborhood
V(network)$ChurnNeighbors2 <- as.vector(SecondOrderMatrix %*% V(network)$churn)

# Compute the number of non-churners in the second order neighborhood
V(network)$NonChurnNeighbors2 <- as.vector(SecondOrderMatrix %*% (1 - V(network)$churn))

# Compute the relational neighbor probability in the second order neighborhood
V(network)$RelationalNeighbor2 <- as.vector(V(network)$ChurnNeighbors2 / 
    (V(network)$ChurnNeighbors2 + V(network)$NonChurnNeighbors2))


degree <- degree(network)

# Extract the average degree of neighboring nodes
V(network)$averageDegree <- 
    as.vector(AdjacencyMatrix %*% V(network)$degree) / degree

# Extract the average number of triangles of neighboring nodes
V(network)$averageTriangles <- 
    as.vector(AdjacencyMatrix %*% V(network)$triangles) / degree

# Extract the average transitivity of neighboring nodes    
V(network)$averageTransitivity <-
    as.vector(AdjacencyMatrix %*% V(network)$transitivity) / degree

# Extract the average betweeness of neighboring nodes    
V(network)$averageBetweenness <- 
    as.vector(AdjacencyMatrix %*% V(network)$betweenness) / degree


# Compute one iteration of PageRank 
# iter1 <- page.rank(network, algo = 'power', options = list(niter = 1))$vector

# Compute two iterations of PageRank 
# iter2 <- page.rank(network, algo = 'power', options = list(niter = 2))$vector

# Inspect the change between one and two iterations
# sum(abs(iter1 - iter2))

# Inspect the change between nine and ten iterations
# sum(abs(iter9 - iter10))


# Create an empty vector
# value <- c()

# Write a loop to compute PageRank 
# for(i in 1:15){
#   value <- cbind(value, page.rank(network, algo = 'power',options = list(niter = i))$vector)
# }
  
# Compute the differences 
# difference <- colSums(abs(value[,1:14] - value[,2:15]))

# Plot the differences
# plot(1:14, difference)


# boxplots <- function(damping=0.85, personalized=FALSE){
#   if(personalized){
#     V(network)$pp<-page.rank(network,damping=damping,personalized = V(network)$Churn)$vector
#   }
#   else{
#   V(network)$pp<-page.rank(network,damping=damping)$vector
#   }
#   boxplot(V(network)$pp~V(network)$Churn)#
# }

# Look at the distribution of standard PageRank scores
# boxplots(damping = 0.85)

# Inspect the distribution of personalized PageRank scores
# boxplots(damping = 0.85, personalized = TRUE)

# Look at the standard PageRank with damping factor 0.2
# boxplots(damping = 0.2)

# Inspect the personalized PageRank scores with a damping factor 0.99
# boxplots(damping=0.99, personalized = TRUE)


# Compute the default PageRank score
# V(network)$pr_0.85 <- page.rank(network)$vector

# Compute the PageRank score with damping 0.2
# V(network)$pr_0.20 <- page.rank(network, damping=0.2)$vector

# Compute the personalized PageRank score
# V(network)$perspr_0.85 <- page.rank(network, damping=0.85, personalized = V(network)$Churn)$vector

# Compute the personalized PageRank score with damping 0.99
# V(network)$perspr_0.99 <- page.rank(network, damping=0.99, personalized = V(network)$Churn)$vector

Chapter 4 - Putting It All Together

Extract Dataset:

  • May want to extract some of the features from the nodes in the igraph
    • g # prints the object
    • as_data_frame(g,what=‘vertices’) # extracts to data frame - what= can be nodes (vertices) or edges
  • May want to preprocess, particularly for missing values (non-disclosed, non-relevant, processing error, etc.)
    • sum(is.na(dataset$degree))
  • May want to understand the correlations among the node features
    • M <- cor(dataset[,-1])
    • corrplot::corrplot(M, method = ‘circle’)

Building Predictive Models:

  • Building a predictive model using supervised learning
  • Can split the dataset from the graph in to a test and train dataset
  • Can use either logistic regression and random forests
    • glm(R~degree+pageRank, dataset=training_set,family=‘binomial’)
    • library(randomForest)
    • rfModel<-randomForest(R~., dataset=training_set)
    • varImpPlot(rfModel)

Evaluating Model Performance:

  • Can evaluate the model using the test dataset, along with the functions in the pROC library
    • logPredictions <- predict(logModel, newdata = test_set, type = “response”) # probability based on the logistic regression
    • rfPredictions<- predict(rfModel, newdata = test_set, type=‘prob’) # two column matrix, one for the probability of each of the factor states (two in this case)
  • Can evaluate the model based on AUC which is often between 0.5 (random) and 1.0 (perfect)
    • library(pROC)
    • auc(test_set$label, logPredictions)
  • Can also assess “top decile” lift - looks at actual churn among the top-10% churn probabilities
    • library(lift)
    • TopDecileLift(test_set$label, predictions, plot=TRUE)

Wrap Up:

  • Accurately predicting labels for network data
  • Labeled networks that convert edgelists to networks using igraph
  • Homophily - idea that “birds of a feather flock together”
    • Dyadicity - connectedness of nodes of same labels
    • Heterophilicity - connectednedd of nodes of different labels
  • Can featurize the network using igraph
  • Can create datasets based on networks, then use that for modeling
    • dataset <- as_data_frame(g, what=‘vertices’)
    • glm(R~., dataset=training_set, family=‘binomial’)
    • logPredictions <- predict(logModel, newdata=test_set, type=“response”)
    • auc(test_set$label, logPredictions)
    • TopDecileLift(test_set$label, predictions, plot=TRUE)

Example code includes:

# Extract the dataset
dataset_full <- as_data_frame(network, what = "vertices")
dataset_full$Future <- 0
dsF1 <- c(404, 550, 41, 613, 48, 230, 294, 852, 93, 520, 617, 523, 714, 282, 705, 153, 995, 511, 204, 273, 194, 756, 979, 879, 843, 713, 837, 636, 469, 478, 938, 654, 751, 775)
dataset_full[match(dsF1, dataset_full$name), "Future"] <- 1

# Inspect the dataset
head(dataset_full)
##   name churn      degree     degree2 triangles betweenness  closeness
## 1    1     0 0.006282723 0.021989529         2 0.008143888 0.08535931
## 2    2     0 0.006282723 0.030366492         0 0.019841794 0.08965452
## 3    3     0 0.005235602 0.019895288         0 0.008281176 0.08587357
## 4    4     0 0.003141361 0.012565445         0 0.003340597 0.08377928
## 5    5     0 0.003141361 0.008376963         1 0.002081070 0.07731541
## 6    6     0 0.002094241 0.010471204         0 0.001766448 0.08158209
##   eigenCentrality transitivity ChurnNeighbors NonChurnNeighbors
## 1      0.16674956    0.1333333              0                 6
## 2      0.25675402    0.0000000              0                 6
## 3      0.10174163    0.0000000              0                 5
## 4      0.06977430    0.0000000              0                 3
## 5      0.01838209    0.3333333              0                 3
## 6      0.06214509    0.0000000              0                 2
##   RelationalNeighbor ChurnNeighbors2 NonChurnNeighbors2
## 1                  0               3                 22
## 2                  0               0                 28
## 3                  0               2                 16
## 4                  0               0                 11
## 5                  0               0                  9
## 6                  0               0                  9
##   RelationalNeighbor2 averageDegree averageTriangles averageTransitivity
## 1           0.1200000   0.004363002        0.8333333          0.13888889
## 2           0.0000000   0.004886562        0.5000000          0.03888889
## 3           0.1111111   0.003769634        0.6000000          0.08666667
## 4           0.0000000   0.003839442        0.0000000          0.00000000
## 5           0.0000000   0.003141361        0.6666667          0.11111111
## 6           0.0000000   0.004712042        1.0000000          0.13333333
##   averageBetweenness Future
## 1        0.005713676      0
## 2        0.013126033      0
## 3        0.005525569      0
## 4        0.005168532      0
## 5        0.003089281      0
## 6        0.005718188      0
# Remove customers who already churned
dataset_filtered <- dataset_full[-which(dataset_full$churn == 1), ]

# Remove useless columns
dataset <- dataset_filtered[, -c(1, 2)]


# Inspect the feature
summary(dataset$RelationalNeighbor2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.02509 0.04000 0.33333
# Find the indeces of the missing values
toReplace <- which(is.na(dataset$RelationalNeighbor2))

# Replace the missing values with 0
dataset$RelationalNeighbor2[toReplace] <- 0

# Inspect the feature again
summary(dataset$RelationalNeighbor2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.02509 0.04000 0.33333
# Generate the correlation matrix
M <- cor(dataset[,])

# Plot the correlations
corrplot::corrplot(M, method = "circle")

# Print the column names
colnames(dataset)
##  [1] "degree"              "degree2"             "triangles"          
##  [4] "betweenness"         "closeness"           "eigenCentrality"    
##  [7] "transitivity"        "ChurnNeighbors"      "NonChurnNeighbors"  
## [10] "RelationalNeighbor"  "ChurnNeighbors2"     "NonChurnNeighbors2" 
## [13] "RelationalNeighbor2" "averageDegree"       "averageTriangles"   
## [16] "averageTransitivity" "averageBetweenness"  "Future"
# Create toRemove
# toRemove <- c(10, 13, 19, 22)

# Remove the columns
# dataset <- dataset[, -toRemove]


# Set the seed
set.seed(7)

# Creat the index vector
index_train <- sample(1:nrow(dataset), round((2/3) * nrow(dataset), 0), replace=FALSE)

# Make the training set
training_set <- dataset[index_train,]

# Make the test set
test_set <- dataset[-index_train,]


# Make firstModel
firstModel <- glm(Future ~ degree + degree2 + triangles + betweenness + closeness + transitivity, 
                  family = "binomial", data = training_set
                  )

# Build the model
secondModel <- glm(Future ~ ChurnNeighbors + RelationalNeighbor + ChurnNeighbors2 + RelationalNeighbor2 + averageDegree + averageTriangles + averageTransitivity + averageBetweenness, 
                   family = "binomial", data = training_set
                   )

# Build the model
thirdModel <- glm(Future ~ ., data=training_set, family="binomial")


# Set seed
set.seed(863)

# Build model
rfModel <- randomForest::randomForest(as.factor(Future)~. ,data=training_set)

# Plot variable importance
randomForest::varImpPlot(rfModel)

# Predict with the first model
firstPredictions <- predict(firstModel, newdata = test_set, type = "response")

# Predict with the first model
secondPredictions <- predict(secondModel, newdata = test_set, type = "response")

# Predict with the first model
thirdPredictions <- predict(thirdModel, newdata = test_set, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
# Predict with the first model
rfPredictions<- predict(rfModel, newdata = test_set, type = "prob")

sapply(list(firstPredictions, secondPredictions, thirdPredictions, rfPredictions[, 2]), 
       FUN=function(x) { pROC::auc(test_set$Future, x) }
       )
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] 0.7321538 0.6514643 0.7031727 0.4804759

Bayesian Regression Modeling with rstanarm

Chapter 1 - Introduction to Bayesian Linear Models

Non-Bayesian Linear Regression:

  • Can use the “kidiq” dataset from the rstanarm package
  • Objective is to predict the child IQ from the mom IQ
    • lm_model <- lm(kid_score ~ mom_iq, data = kidiq)
    • summary(lm_model)
    • broom::tidy(lm_model)
  • Challenge of the p-value is that it is only a comparison to the null hypothesis
  • Often interested in understanding the underlying population, as well the statistic calculated
  • Can use the “songs” dataset from Spotify

Bayesian Linear Regression:

  • Bayesian methods sample from the posterior distribution, allowing for inferences about values the parameters might take
  • The “rstanarm” package allows a high-level interface to the Stan library
    • library(rstanarm)
    • stan_model <- stan_glm(kid_score ~ mom_iq, data = kidiq)
    • summary(stan_model)
  • There are several descriptive statistics in the output
    • sigma: Standard deviation of errors
    • mean_PPD: mean of posterior predictive samples
    • log-posterior: analogous to a likelihood
    • Rhat: a measure of within chain variance compared to across chain variance (stability of estimates, with a goal of less than 1.1 for all parameters)

Comparing frequentist and Bayesian models:

  • The Bayesian and frequentist methods produce very similar outputs
  • The fundamental difference is that frequentists assume fixed parameters and random data while Bayesians assume fixed data and random parameters
    • Frequentist - the p-value is the probability of a test statistics, given a specific null hypothesis
    • Bayesian - probabilities of parameters vary in their ability to generate the given data
  • Bayesian approaches use the credible interval, which is very similar to the confidence interval
    • Confidence interval: Probability that a range contains the true value
    • Credible interval: Probability that the true value is within a range
    • posterior_interval(stan_model) # gives the credible intervals (90% by default)
    • posterior_interval(stan_model, prob = 0.95) # gives the 95% credible intervals
  • Can look at both the confidence intervals and the credible intervals for the same dataset and different models
    • confint(lm_model, parm = “mom_iq”, level = 0.95)
    • posterior_interval(stan_model, pars = “mom_iq”, prob = 0.95)
  • Can also look at probabilities that a parameter is between several key points
    • posterior <- spread_draws(stan_model, mom_iq)
    • mean(between(posterior_mom_iq, 0.60, 0.65)) # Bayesian methods allow for actual inferences about the parameter

Example code includes:

# Print the first 6 rows
head(songs)

# Print the structure
str(songs)


# Create the model here
lm_model <- lm(popularity ~ song_age, data = songs)

# Produce the summary
summary(lm_model)

# Print a tidy summary of the coefficients
tidy(lm_model)


# Create the model here
stan_model <- stan_glm(popularity ~ song_age, data = songs)

# Produce the summary
summary(stan_model)

# Print a tidy summary of the coefficients
tidy(stan_model)


# Create the 90% credible intervals
posterior_interval(stan_model)

# Create the 95% credible intervals
posterior_interval(stan_model, prob = 0.95)

# Create the 80% credible intervals
posterior_interval(stan_model, prob = 0.8)

Chapter 2 - Modifying a Bayesian Model

What is in a Bayesian Model?

  • Many levers for modifying a Bayesian model
  • Posterior distributions are sampled in groups called chains (iterations) that begin in random areas
    • Chains move to where there is a good combination of the priors and the parameters
    • Lengths of the chains allow for more robust parameter estimates
  • Convergence is importance, since it ensures a consistent output - chains start at different places, and so the starting iterations (prior to convergence) are discarded
    • The chains start with a few thousand warm-up or burn-in for each of the chains
  • The number of iterations is a balancing act - not enough (convergence problems) or too many (run-time problems)

Prior Distributions:

  • Priors can inform the posterior distribution, given the same data in the experiment
    • More informative priors typically have narrower distributions
    • Can think of the prior as being like an additional data point - more meaningful the less data that we have
    • Generally a best practice to use a weakly informative prior, absent good cause for a strong belief
  • Example of including prior distributions in rstanarm
    • stan_model <- stan_glm(kid_score ~ mom_iq, data = kidiq)
    • prior_summary(stan_model){{1}}
  • Can calculate adjusted scales for the intercept as 10 * sd(y), since 10 is the default for the package
  • Can calculate adjusted scales for the coefficients as (2.5 / sd(x)) * sd(y), since 2.5 is the default for the package
    • prior_summary(stan_model)
  • Can also use unadjusted priors, such as
    • no_scale <- stan_glm(kid_score ~ mom_iq, data = kidiq, prior_intercept = normal(autoscale = FALSE), prior = normal(autoscale = FALSE), prior_aux = exponential(autoscale = FALSE) )
    • prior_summary(no_scale)

User-Specified Priors:

  • Can use a specified prior distribution using the same arguments as in the previous paragraph
    • Research may suggest the parameter should be near a specific value - take advantage of narrow prior
    • Parameters may be constrained, such as a need to be always positive
  • Can specify the priors to be cast to specific values
    • stan_model <- stan_glm(kid_score ~ mom_iq, data = kidiq, prior_intercept = normal(location = 0, scale = 10), prior = normal(location = 0, scale = 2.5), prior_aux = exponential(rate = 1) )
    • stan_model <- stan_glm(kid_score ~ mom_iq, data = kidiq, prior_intercept = normal(location = 0, scale = 10, autoscale = FALSE), prior = normal(location = 0, scale = 2.5, autoscale = FALSE), prior_aux = exponential(rate = 1, autoscale = FALSE) )
    • The autoscale=FALSE is needed so that stan_glm does not rescale the data before applying the priors
    • stan_model <- stan_glm(kid_score ~ mom_iq, data = kidiq, prior_intercept = normal(location = 3, scale = 2), prior = cauchy(location = 0, scale = 1), )
  • There are many types of prior distributions that can be used
    • ?priors
  • Can also set a flat prior (prior provides zero information)
    • stan_model <- stan_glm(kid_score ~ mom_iq, data = kidiq, prior_intercept = NULL, prior = NULL, prior_aux = NULL )
    • This is usually a bad idea; rarely in a state where there is zero information, and a weakly informative prior with an adjusted scale is typically better

Tuning Models for Stability:

  • May need to alter estimation parameters
  • Divergent transitions are when the steps are too big - need to take smaller steps (longer run time)
    • stan_model <- stan_glm(popularity ~ song_age, data = songs, control = list(adapt_delta = 0.95)) # 0.95 is default, increase will decrease step size
  • May have already reached the maximum tree depth in some of the chains (indicates poor efficiency; no good stopping place, and insufficient sampling of the posterior)
    • stan_model <- stan_glm(popularity ~ song_age, data = songs, control = list(max_treedepth = 10)) # 10 is the default, increases allow more sampling

Example code includes:

# 3 chains, 1000 iterations, 500 warmup
model_3chains <- stan_glm(popularity ~ song_age, data = songs,
    chains = 3, iter = 1000, warmup = 500)

# Print a summary of model_3chains
summary(model_3chains)

# 2 chains, 100 iterations, 50 warmup
model_2chains <- stan_glm(popularity ~ song_age, data = songs,
    chains = 2, iter = 100, warmup = 50)

# Print a summary of model_1chain
summary(model_2chains)


# Estimate the model
stan_model <- stan_glm(popularity ~ song_age, data = songs)

# Print a summary of the prior distributions
prior_summary(stan_model)


# Calculate the adjusted scale for the intercept
10 * sd(songs$popularity)

# Calculate the adjusted scale for `song_age`
(2.5 / sd(songs$song_age)) * sd(songs$popularity)

# Calculate the adjusted scale for `valence`
(2.5 / sd(songs$valence)) * sd(songs$popularity)


# Estimate the model with unadjusted scales
no_scale <- stan_glm(popularity ~ song_age, data = songs,
    prior_intercept = normal(autoscale = FALSE),
    prior = normal(autoscale = FALSE),
    prior_aux = exponential(autoscale = FALSE)
)

# Print the prior summary
prior_summary(no_scale)


# Estimate a model with flat priors
flat_prior <- stan_glm(popularity ~ song_age, data = songs,
    prior_intercept = NULL, prior = NULL, prior_aux = NULL)

# Print a prior summary
prior_summary(flat_prior)


# Estimate the model with an informative prior
inform_prior <- stan_glm(popularity ~ song_age, data = songs,
    prior = normal(location = 20, scale = 0.1, autoscale = FALSE))

# Print the prior summary
prior_summary(inform_prior)


# Estimate the model with a new `adapt_delta`
adapt_model <- stan_glm(popularity ~ song_age, data = songs,
  control = list(adapt_delta = 0.99))

# View summary
summary(adapt_model)

# Estimate the model with a new `max_treedepth`
tree_model <- stan_glm(popularity ~ song_age, data = songs,
  control = list(max_treedepth = 15))

# View summary
summary(tree_model)

Chapter 3 - Assessing Model Fit

Using R-Squared Statistics:

  • The R-squared statistic is a measure of how well the model predicts the dependent variable (proportion of variance explained) - “coefficient of determination”
    • R-squared = 1 - RSS/TSS where RSS is residual sum-squares and TSS is total sum-squares
    • lm_model <- lm(kid_score ~ mom_iq, data = kidiq)
    • lm_summary <- summary(lm_model)
    • lm_summary$r.squared
    • ss_res <- var(residuals(lm_model))
    • ss_total <- var(residuals(lm_model)) + var(fitted(lm_model))
    • 1 - (ss_res / ss_total)
  • The R-squared is not save by the stan_glm() call, but can be calculated
    • stan_model <- stan_glm(kid_score ~ mom_iq, data = kidiq)
    • ss_res <- var(residuals(stan_model))
    • ss_total <- var(fitted(stan_model)) + var(residuals(stan_model))
    • 1 - (ss_res / ss_total)

Posterior Predictive Model Checks:

  • Can use posterior distributions to check the model
    • stan_model <- stan_glm(kid_score ~ mom_iq, data = kidiq)
    • spread_draws(stan_model, (Intercept), mom_iq) %>% select(-.draw)
    • predictions <- posterior_linpred(stan_model) # predicted scores using each of the sets of parameter values and each data point
  • Can compare the distributions of predicted and observed scores
    • iter1 <- predictions[1,]
    • iter2 <- predictions[2,]
    • summary(kidiq$kid_score)
    • summary(iter1)
    • summary(iter2)

Model Fit with Posterior Predictive Model Checks:

  • Can use Bayesian functions as part of the post-processing
    • stan_model <- stan_glm(kid_score ~ mom_iq, data = kidiq)
    • r2_posterior <- bayes_R2(stan_model)
    • summary(r2_posterior)
    • quantile(r2_posterior, probs = c(0.025, 0.975))
    • hist(r2_posterior)
    • pp_check(stan_model, “dens_overlay”) # compare densities
    • pp_check(stan_model, “stat”) # compare statistic to histogram
  • The mean is only one aspect - can look at many aspects of the dependent variable
    • pp_check(stan_model, “stat_2d”) # mean and sd plotted on a 2-D plot

Bayesian Model Comparisons:

  • Can compare two or more models produced using rstanarm
  • The LOO (leave one out) package runs a modified (approximated) form of LOO cross-validation
    • library(loo)
    • stan_model <- stan_glm(kid_score ~ mom_iq, data = kidiq)
    • loo(stan_model)
  • The values from LOO mainly have value in comparison to similar models (much like ANOVA for comparing nested linear regression)
    • model_1pred <- stan_glm(kid_score ~ mom_iq, data = kidiq)
    • model_2pred <- stan_glm(kid_score ~ mom_iq * mom_hs, data = kidiq)
    • loo_1pred <- loo(model_1pred)
    • loo_2pred <- loo(model_2pred)
    • compare(loo_1pred, loo_2pred)
  • The compare() function provides the difference in loo, along with an SE
    • Positive means “prefer second model”
    • Negative means “prefer first model”
    • The SE helps assess whether it is meaningful - rule of thumb is to require change in loo greater than SE, otherwise prefer the simpler model

Example code includes:

# Print the R-squared from the linear model
lm_summary$r.squared

# Calulate sums of squares
ss_res <- var(residuals(lm_model))
ss_fit <- var(fitted(lm_model))

# Calculate the R-squared
1 - (ss_res / (ss_res + ss_fit))


# Save the variance of residulas
ss_res <- var(residuals(stan_model))

# Save the variance of fitted values
ss_fit <- var(fitted(stan_model))

# Calculate the R-squared
1 - (ss_res / (ss_res + ss_fit))


# Calculate posterior predictive scores
predictions <- posterior_linpred(stan_model)

# Print a summary of the observed data
summary(songs$popularity)

# Print a summary of the 1st replication
summary(predictions[1,])

# Print a summary of the 10th replication
summary(predictions[10,])


# Calculate the posterior distribution of the R-squared
r2_posterior <- bayes_R2(stan_model)

# Make a histogram of the distribution
hist(r2_posterior)


# Create density comparison
pp_check(stan_model, "dens_overlay")

# Create scatter plot of means and standard deviations
pp_check(stan_model, "stat_2d")


# Estimate the model with 1 predictor
model_1pred <- stan_glm(popularity ~ song_age, data = songs)

# Print the LOO estimate for the 1 predictor model
loo(model_1pred)

# Estimate the model with both predictors
model_2pred <- stan_glm(popularity ~ song_age * artist_name, data = songs)

# Print the LOO estimates for the 2 predictor model
loo(model_2pred)

Chapter 4 - Presenting and Using Bayesian Regression

Visualizing Bayesian Models:

  • Can save the model coefficients using tidy() and then plot the regression against the underlying point
    • stan_model <- stan_glm(kid_score ~ mom_iq, data = kidiq)
    • tidy(stan_model)
    • tidy_coef <- tidy(stan_model)
    • model_intercept <- tidy_coef$estimate[1]
    • model_slope <- tidy_coef$estimate[2]
    • ggplot(kidiq, aes(x = mom_iq, y = kid_score)) + geom_point() + geom_abline(intercept = model_intercept, slope = model_slope,)
  • Can also plot uncertainty using the posterior distribution
    • draws <- spread_draws(stan_model, (Intercept), mom_iq)
    • ggplot(kidiq, aes(x = mom_iq, y = kid_score)) + geom_point()
    • ggplot(kidiq, aes(x = mom_iq, y = kid_score)) + geom_point() geom_abline(data = draws, aes(intercept = (Intercept), slope = mom_iq), size = 0.2, alpha = 0.1, color = “skyblue”)
    • ggplot(kidiq, aes(x = mom_iq, y = kid_score)) + geom_point() geom_abline(data = draws, aes(intercept = (Intercept), slope = mom_iq), size = 0.2, alpha = 0.1, color = “skyblue”) + geom_abline(intercept = model_intercept, slope = model_slope)

Making Predictions:

  • Can make predictions for the observed data
    • stan_model <- stan_glm(kid_score ~ mom_iq + mom_hs, data = kidiq)
    • posteriors <- posterior_predict(stan_model)
    • predict_data <- data.frame( mom_iq = 110, mom_hs = c(0, 1) )
    • new_predictions <- posterior_predict(stan_model, newdata = predict_data)

Visualizing Predictions:

  • Can plot predictions based on the new data
    • stan_model <- stan_glm(kid_score ~ mom_iq + mom_hs, data = kidiq)
    • predict_data <- data.frame( mom_iq = 110, mom_hs = c(0, 1) )
    • posterior <- posterior_predict(stan_model, newdata = predict_data)
    • posterior <- as.data.frame(posterior)
    • colnames(posterior) <- c(“No HS”, “Completed HS”)
    • plot_posterior <- gather(posterior, key = “HS”, value = “predict”)
    • ggplot(plot_posterior, aes(x = predict)) + facet_wrap(~ HS, ncol = 1) + geom_density()

Conclusion:

  • One solution to inferences - implementing Bayesian models
    • Differences between frequentist and Bayesian
    • Importance of making correct inferences
  • Modifying a Bayesian model
  • Evaluating fit of a Bayesian model
  • Using the model to make predictions and communicate results
  • Additional topics for exploration include

Example code includes:

# Save the model parameters
tidy_coef <- tidy(stan_model)

# Extract intercept and slope
model_intercept <- tidy_coef$estimate[1]
model_slope <- tidy_coef$estimate[2]

# Create the plot
ggplot(songs, aes(x = song_age, y = popularity)) +
  geom_point() +
  geom_abline(intercept = model_intercept, slope = model_slope)


# Save the values from each draw of the posterior distribution
draws <- spread_draws(stan_model, `(Intercept)`, `song_age`)

# Print the `draws` data frame to the console
draws

# Create the plot
ggplot(songs, aes(x = song_age, y = popularity)) +
  geom_point()

# Create the plot
ggplot(songs, aes(x = song_age, y = popularity)) +
    geom_point() +
    geom_abline(data = draws, aes(intercept = `(Intercept)`, slope = song_age), 
                size = 0.1, alpha = 0.2, color = "skyblue"
                )

# Create the plot
ggplot(songs, aes(x = song_age, y = popularity)) +
    geom_point() +
    geom_abline(data = draws, aes(intercept = `(Intercept)`, slope = song_age), 
                size = 0.1, alpha = 0.2, color = "skyblue"
                ) +
    geom_abline(intercept = model_intercept, slope = model_slope)

# Estimate the regression model
stan_model <- stan_glm(popularity ~ song_age + artist_name, data = songs)

# Print the model summary
summary(stan_model)

# Get posteriors of predicted scores for each observation
posteriors <- posterior_predict(stan_model)

# Print 10 predicted scores for 5 songs
posteriors[1:10, 1:5]


# Create data frame of new data
predict_data <- data.frame(song_age = 663, artist_name = "Beyoncé")

# Create posterior predictions for Lemonade album
new_predictions <- posterior_predict(stan_model, newdata = predict_data)

# Print first 10 predictions for the new data
new_predictions[1:10, ]

# Print a summary of the posterior distribution of predicted popularity
summary(new_predictions[, 1])


# View new data predictions
new_predictions[1:10, ]

# Convert to data frame and rename variables
new_predictions <- as.data.frame(new_predictions)
colnames(new_predictions) <- c("Adele", "Taylor Swift", "Beyoncé")

# Create tidy data structure
plot_posterior <- gather(new_predictions, key = "artist_name", value = "predict")

# Print formated data
head(plot_posterior)


# Create plot of 
ggplot(plot_posterior, aes(x = predict)) +
    facet_wrap(~ artist_name, ncol = 1) +
    geom_density()

ChIP-seq Workflows in R

Chapter 1 - Introduction to ChIP-seq

What is ChIP-seq?

  • The core of ChIP-seq is understanding how the cells in the body know what to do
  • The function of a cell is largely determined by the expressed genes (DNA - RNA - Proteins)
    • Inhibitors will stop the process
    • Regualtors help keep the cell producing the right proteins - disorders lead to diseases like cancer
  • Can use ChIP-seq to find over-represented genes in various patient cohorts (such as cancer patients)
  • Dataset for the course will be prostate cancer - 5 primary tumors, 3 treatment resistent
    • library(GenomicAlignments)
    • reads <- readGAlignments(‘file_name’)
    • seqnames(reads) # obtaining read coordinates
    • start(reads) # obtaining read coordinates
    • end(reads) # obtaining read coordinates
    • coverage(reads) # computing coverage
  • Can also access the peack cells in the ChIP-seq experiment
    • library(rtracklayer)
    • peaks <- import.bed(‘file_name’)
    • chrom(peaks)
    • ranges(peaks)
    • score(peaks)

ChIP-seq Workflow:

  • First step of the workflow is read mapping - mapping reads to genomes
  • Then, create a coverage profile (number of overlapping reads)
  • Then, import the mapped reads and verify the quality (these are usually the first steps in R - the above steps are usually in more specialized software)
  • Then, identify peaks by comparing samples - AR sample sites that are over-used
  • Then, interpret the findings for a better understanding of the associated biological processes
    • Heat maps - differences in cells and similarities in cells
    • Creating UpSet plots, such as upset(fromList(peak_sets))

ChIP-seq Results Summary:

  • The heatmap plot is useful for assessing sample quality - check for expected patterns
  • Heights of individual peaks across samples can be more informative - height of peak by group can be particularly insightful

Example code includes:

# Print a summary of the 'reads' object
print(reads)

# Get the start position of the first read
start_first <- start(reads)[1]

# Get the end position of the last read
end_last <- end(reads)[length(reads)]

# Compute the number of reads covering each position in the selected region
cvg <- coverage(reads)


# Print a summary of the 'peaks' object
print(peaks)

# Use the score function to find the index of the highest scoring peak
max_idx <- which.max(score(peaks))

# Extract the genomic coordinates of the highest scoring peak using the `chrom` and `ranges` functions
max_peak_chrom <- chrom(peaks)[max_idx]
max_peak_range <- ranges(peaks)[max_idx]


# Create a vector of colors to label groups (there are 2 samples per group)
group <- c(primary = rep("blue", 2), TURP = rep("red", 2))

# Plot the sample correlation matrix `sample_cor` as a heat map
heatmap(sample_cor, ColSideColors = group, RowSideColors = group, 
        cexCol = 0.75, cexRow = 0.75, symm = TRUE)

# Create a heat map of peak read counts
heatmap(read_counts, ColSideColors = group, labRow = "", cexCol = 0.75)


# Take a look at the full gene sets
print(ar_sets)

# Visualise the overlap between the two groups using the `upset` function
upset(fromList(ar_sets))

# Print the genes with differential binding
print(db_sets)

# Visualise the overlap between the two groups using the `upset` function
upset(fromList(db_sets))

Chapter 2 - Back to Basics - Preparing ChIP-seq Data

Importing Data:

  • Read mapping and peak calling are typically carried out with specialized (non-R) tools and then stored in a BAM (Binary Sequence Alignment Map) format
  • Can use Rsamtools package to interact with BAM files
    • Rsamtools provides functions for indexing, reading, filtering and writing of BAM files
  • Use readGAlignments to import mapped reads
    • library(GenomicAlignments)
    • reads <- readGAlignments(bam_file)
  • Use BamViews to define regions of interest
    • library(GenomicRanges)
    • library(Rsamtools)
    • ranges <- GRanges(…)
    • views <- BamViews(bam_file, bamRanges=ranges)
  • Use import.bed to load peak calls from a BED file
    • library(rtracklayer)
    • peaks <- import.bed(peak_bed, genome=“hg19”)
    • bams <- BamViews(bam_file, bamRanges=peaks)
    • reads <- readGAlignments(bams)

Closer Look at Peaks:

  • Can use Gvix to combine data from multiple sources in to a single plot
    • Where are the reads located, and the associated coverage
    • Annotations of key features of read coverage
  • Need to load the key library and set context for data to be plotted
    • library(Gviz)
    • ideogram <- IdeogramTrack(“chr12”, “hg19”)
    • axis <- GenomeAxisTrack()
    • plotTracks(list(ideogram, axis), from=101360000, to=101380000)
  • Can then add data to the track
    • cover_track <- DataTrack(cover_ranges,window=100000,type=‘h’,name=“Coverage”)
    • plotTracks(list(ideogram, cover_track, axis), from=101360000, to=101380000)
    • peak_track <- AnnotationTrack(peaks, name=“Peaks”)
    • plotTracks(list(ideogram, cover_track, peak_track, axis), from=101360000, to=101380000)
    • library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    • tx <- GeneRegionTrack(TxDb.Hsapiens.UCSC.hg19.knownGene, chromosome=“chr12”, start=101360000, end=101380000, name=“Genes”)
    • plotTracks(list(ideogram, cover_track, peak_track, tx, axis), from=101360000, to=101380000)

Cleaning ChIP-seq Data:

  • Incorrectly mapped reads can produce false peaks - “genomic repeats”
    • Particularly problematic if the reference and the sample differ in the number of false reads
    • Low complexity regions (such as end of chromosomes) tend to have poorer quality
  • Amplification bias can be a concern - just prior to sequencing
  • Quality Control reports can help diagnose the potential issues
    • library(ChIPQC)
    • qc_report <- ChIPQC(experiment=“sample_info.csv”, annotation=“hg19”)
    • ChIPQCreport(qc_report)
    • Input is a CSV file mapping samples to input files and descriptions
  • Cleaning the data has many steps
    • Remove duplicate reads
    • Remove reads with multiple hits or low mapping quality
    • Remove peaks in “blacklisted” regions – available from the ENCODE project

Assessing Enrichment:

  • Reading finds the ends where a protein is created, meaning there will also be more than one read representing the ends of the fragment (?)
  • Can be helpful to do read coverage forward and backwards and to then aggregate findings
    • reads <- readGAlignments(bam)
    • reads_gr <- granges(reads[[1]])
    • frag_length <- fragmentlength(qc_report)[“GSM1598218”]
    • reads_ext <- resize(reads_gr, width=frag_length)
    • cover_ext <- coverage(reads_ext)
  • Question is how does coverage look in peaks relative to the rest of the genome
    • bins <- tileGenome(seqinfo(reads), tilewidth=200, cut.last.tile.in.chrom=TRUE) # create 200 bins along the genome
    • peak_bins_overlap <- findOverlaps(bins, peaks) # can find the overlaps
    • peak_bins <- bins[from(peak_bins_overlap), ] # subset to just the overlaps
    • peak_bins$score <- countOverlaps(peak_bins, reads) # count number of overlapping reads
  • Can create a function for the binning reads process
    • count_bins <- function(reads, target, bins){
    • Find all bins overlapping peaks

    • overlap <- from(findOverlaps(bins, target))
    • target_bins <- bins[overlap, ]
    • Count the number of reads overlapping each peak bin

    • target_bins$score <- countOverlaps(target_bins, reads)
    • target_bins
    • }
  • Can find coverage for the blacklisted regions in much the same way
    • peak_bins <- count_bins(reads_ext, peaks, bins)
    • bl_bins <- count_bins(reads_ext, blacklist.hg19, bins)
  • Can then get the background coverage by considering all of the remaining bins
    • bkg_bins <- subset(bins, !bins %in% peak_bins & !bins %in% bl_bins)
    • bkg_bins$score <- countOverlaps(bkg_bins, reads_ext)

Example code includes:

# Load reads form chr20_bam file
reads <- readGAlignments(chr20_bam)

# Create a `BamViews` object for the range 29805000 - 29820000 on chromosome 20
bam_views <- BamViews(chr20_bam, bamRanges=GRanges("chr20", IRanges(start=29805000, end=29820000)))

# Load only the reads in that view
reads_sub <- readGAlignments(bam_views)

# Print the `reads_sub` object
str(reads_sub)


# Load peak calls from chr20_peaks
peaks <- import.bed(chr20_peaks, genome="hg19")

# Create a BamViews object
bam_views <- BamViews(chr20_bam, bamRanges=peaks)

# Load the reads
reads <- readGAlignments(bam_views)


# Create tracks
peak_track <- AnnotationTrack(peak_calls, name="Peaks")
cover_track <- DataTrack(cover_ranges, window=10500, type="polygon", name="Coverage", fill.mountain=c("lighgrey", "lightgrey"), col.mountain="grey")

# Highlight peak locations across tracks
peak_highlight <- HighlightTrack(trackList = list(cover_track, peak_track), range = peak_calls)

# Produce plot
plotTracks(list(ideogram, peak_highlight, GenomeAxisTrack()), chromosome="chr20", from=start_pos, to=end_pos)


# Load reads with mapping qualities by requesting the "mapq" entries
reads <- readGAlignments(bam_file, param=ScanBamParam(what="mapq"))

# Identify good quality alignments
high_mapq <- mcols(reads)$mapq >= 20

# Examine mapping quality distribution for high and low quality alignments
boxplot(mcols(reads)$mapq ~ high_mapq, xlab="good quality alignments", ylab="mapping quality")

# Remove low quality alignments
reads_good <- subset(reads, high_mapq)

Chapter 3 - Comparing ChIP-seq Samples

Introduction to Differential Binding:

  • Objective is to find differences in prostate cancercells that respond or are resistant to treatment
  • Can use PCA to find differences in the groups - directions of the components with the most variation
    • Points can be projected on to the plane
    • qc_result <- ChIPQC(“sample_info.csv”, “hg19”)
    • counts <- dba.count(qc_results, summits=250)
    • plotPrincomp(counts)
  • Can use hierarchical clustering to cluster similar samples for easier future visualization
    • distance <- dist(t(coverage))
    • dendro <- hclust(distance)
    • plot(dendro)
    • dba.plotHeatmap(peaks, maxSites = peak_count, correlations = FALSE)

Testing for Differential Binding:

  • Can use the DiffBind package for either DESeq2 or edgeR
  • Can count the records in a peak dataset
    • peak_counts <- dba.counts(qc_output, summits=250)
  • Can add a contrast for how the sample should be split
    • peak_counts <- dba.contrast(peak_counts, categories = DBA_CONDITION)
  • Want to assess the coverage differences in comparison to a control sequence
    • bind_diff <- dba.analyze(peak_counts)
    • dba.plotPCA(bind_diff, DBA_Condition, contrast=1)
    • dba.plotHeatmap(bind_diff, DBA_Condition, contrast=1)

Closer Look at Differential Binding:

  • Plotting is a good way to look at the results - can use the DiffBind library
  • Can create MA plots - x-axis for concentration, y-axis for resistant vs. log-fold-change of responsive
    • dba.plotMA(dba_object) # MA plots
    • May need to normalize data in the MA plot to eliminate systemic downward bias
  • Can create volcano plots - FDR as a function of log-fold-change
    • dba.plotVolcano(dba_object)
  • Can create box plots
    • dba.plotBox(dba_object)

Example code includes:

# Compute the pairwise distances between samples using `dist`
cover_dist <- dist(t(cover))

# Use `hclust()` to create a dendrogram from the distance matrix
cover_dendro <- hclust(cover_dist)

# Plot the dendrogram
plot(cover_dendro)


# Print the `peaks` object
print(peaks)

# Obtain the coordinates of the merged peaks
merged_peaks <- peaks$merged

# Extract the number of peaks present in the data
peak_count <- nrow(merged_peaks)

# Create a heatmap using the `dba.plotHeatmap()` function
dba.plotHeatmap(peaks, maxSites = peak_count, correlations = FALSE)


# Examine the ar_binding object
print(ar_binding)

# Identify the category corresponding to the tumor type contrast
contrast <- DBA_CONDITION

# Establish the contrast to compare the two tumor types
ar_binding <- dba.contrast(ar_binding, categories=contrast, minMembers=2)

# Examine the ar_binding object again to confirm that the contrast has been added
print(ar_binding)


# Examine the `ar_binding` object to confirm that it contains the required contrast
print(ar_binding)

# Run the differential binding analysis
ar_diff <- dba.analyze(ar_binding)

# Examine the result
print(ar_diff)


# Create a PCA plot using all peaks
dba.plotPCA(ar_diff, DBA_CONDITION)

# Create a PCA plot using only differentially bound peaks
dba.plotPCA(ar_diff, DBA_CONDITION, contrast = 1)

# Create a heatmap using all peaks
dba.plotHeatmap(ar_diff, DBA_CONDITION, correlations = FALSE, maxSites = 440)

# Create a heatmap using only differentially bound peaks
dba.plotHeatmap(ar_diff, DBA_CONDITION, contrast=1, correlations = FALSE)


# Create an MA plot
dba.plotMA(ar_diff)


# Create a volcano plot
dba.plotVolcano(ar_diff)


# Create a box plot of the peak intensities
compare_groups <- dba.plotBox(ar_diff, notch=FALSE)

# Inspect the returned p-values
print(compare_groups)

Chapter 4 - From Peaks to Genes to Function

Interpreting ChIP-seq Peaks:

  • Want to find the genes that impact a particular binding site - no way to know for sure, but often look for “closest” genes
    • Obtain information about gene locations
    • Assign peaks to genes
    • Identify genes associated with stronger peaks in one of the conditions
  • Transcript annotations are helpful
    • library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    • genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
    • library(org.Hs.eg.db) # may be easier for reading/intepreting
    • select(org.Hs.eg.db, keys=gene_id, columns=“SYMBOL”, keytype=“ENTREZID”)
  • Gene symbols are tricky - can have multiple genes given the same symbol
  • Can annotate peaks using the transcript data
    • library(ChIPpeakAnno)
    • annoPeaks(peaks, human_genes, bindingType=“startSite”, bindingRegion=c(-5000,5000))
    • library(DiffBind)
    • dba.plotVenn(peaks, mask=1:2)
  • Can use UpSet plots for better interpretation in larger datasets
    • library(UpSetR)
    • called_peaks <- as.data.frame(peaks$called)
    • upset(called_peaks, sets=colnames(peaks$called), order.by=‘freq’)

Interpreting Gene Lists:

  • Can use a Gene Set image for very long lists of genes - associations with genes of interest
    • library(chipenrich)
    • chipenrich(peaks, genome=‘hg19’, genesets = ‘hallmark’, locusdef = ‘nearest_tss’)

Advanced ChIP-seq Analyses:

  • Loading and analyzing ChIP-seq data in R
  • First step is to import data and then visualize read coverage, peaks, etc.
  • Need to run quality control procedures
  • Can investigate differential binding with DiffBind
  • Can continue to explore additional datasets

Example code includes:

# Extract peaks from ChIPQCexperiment object
peak_calls <- peaks(ar_calls)

# Only keep samples that passed QC
peak_passed <- peak_calls[qc_pass]

# Find overlaps between peak sets
peaks_combined <- findOverlapsOfPeaks(peak_passed[[1]], peak_passed[[2]], 
                                      peak_passed[[3]], peak_passed[[4]], 
                                      maxgap=50
                                      )

# Examine merged peak set
print(peaks_combined)


# Annotate peaks with closest gene
peak_anno <- annoPeaks(peaks_merged, human_genes, bindingType="startSite", bindingRegion=c(-5000,5000))

# How many peaks were found close to genes?
length(peak_anno)

# Where are peaks located relative to genes?
table(peak_anno$insideFeature)


# Create Venn diagram
dba.plotVenn(ar_diff, mask=1:4)

# Convert the matrix of called peaks into a data frame
called_peaks <- as.data.frame(ar_diff$called)

# Create UpSet plot
upset(called_peaks, keep.order = TRUE, sets=colnames(ar_diff$called), order.by="freq")


# Select all peaks with higher intensity in treatment resistant samples
turp_peaks <- peaks_binding[, "GSM1598218"] + peaks_binding[, "GSM1598219"] < peaks_binding[, "GSM1598223"] + peaks_binding[, "GSM1598225"]

# Run enrichment analysis
enrich_turp <- chipenrich(peaks_comb[turp_peaks, ], genome="hg19", 
                   genesets = "hallmark", out_name = NULL, 
                   locusdef = "nearest_tss", qc_plots=FALSE)

# Print the results of the analysis
print(enrich_turp$results)


# Examine the top gene sets
head(enrich_primary$results)

# Extract the gene IDs for the top ranking set
genes <- enrich_primary$results$Geneset.Peak.Genes[1]

# Split gene IDs into a vector
genes_split <- strsplit(genes, ', ')[[1]]

# Convert gene IDs to gene symbols
gene_symbol <- select(org.Hs.eg.db, keys=genes_split, columns="SYMBOL", keytype="ENTREZID")

# Print the result
print(gene_symbol)


# This is the base URL for all KEGG pathways
base_url <- "https://www.kegg.jp/pathway/"

# Add pathway ID to URL
path_url <- paste0(base_url, top_path, collapse="+")

# Collapse gene IDs into selection string
gene_select <- paste(genes, collapse="+")

# Add gene IDs to URL
path_url <- paste(path_url, gene_select, sep="+")

Designing and Analyzing Clinical Trials in R

Chapter 1 - Principles

Fundamentals:

  • Clinical trials are scientific experiments used to evaluate the safety and efficacy of one or more treatments in humans
    • Pharma, medical devices, medical procedures, etc.
  • Four general phases of clinical trials
    • Phase I - small group of healthy volunteers to look for effects and side effects
    • Phase II - small groups of patients with the disease - optimal doses for safety and efficiacy
    • Phase III - chosen dose evaluated for efficacy and safety against a control
    • Phase IV - post-marketing surveillance
  • Randomized control trials are considered the gold-standard for treatment - reduces the impact of confoudning variables (such as in cohort studies)
    • Aim is to have similar patient charcteristics in each of the groups
    • Blinding is when the patient does not know what they are receiving
    • Double-blinding is when the researcher and the patient BOTH do not know what they are receiving

Types of Data and Endpoints:

  • Clinical trials are highly regulated and must meet various international standards
  • Clinical measures are pre-defined in the protocol (endpoints) and can be a mix of primary or secondary measures related to both efficacy and safety
  • Endpoints can be continuous or categorical - example of continuous
    • ggplot(data=exercise, aes(x=sbp_change)) + geom_histogram(fill=“white”, color=“black”) + xlab(“SBP Change, mmHg”)
    • exercise %>% summarise(mean_sbp = mean(sbp_baseline), sd_spb = sd(sbp_baseline))
  • Endpoints can also be categorical, such as did the patient recover in 30 days
    • finaldata %>% group_by(treatment) %>% filter(!is.na(response)) %>% summarise (n = n()) %>% mutate(prop = n / sum(n))
    • table(finaldata\(response, finaldata\)treatment)
  • While binary and continuous endpoints are both common, there are other types also
    • Discrete values, such as drinks per week or years to progression

Basic Statistical Analysis:

  • There is typically a target population (all patients with a disease), a sample population drawn from the target population, and inferences drawn about the target population from the sample population
  • Hypothesis testing is typically run against a null hypothesis - alternative hypothesis can be one-sided or two-sided
    • The p-value is the probability of observing something at least as extreme as our data if the null hypothesis is true - typical hurdles are set at 0.05
  • For comparing distributions, can use the Wilcox test or the Chi-squared test
    • wilcox.test(outcome.variable~ group.variable, data=dataset)
    • table1<-table(care.trial\(group, care.trial\)recover)
    • prop.test(table1, correct=FALSE)

Example code includes:

Acupuncture <- readRDS("./RInputFiles/Ex1_1_1.Rds")

#Explore the Acupuncture dataset with the str() function 
str(Acupuncture)
## 'data.frame':    396 obs. of  18 variables:
##  $ id                           : num  100 101 104 105 108 112 113 114 126 130 ...
##  $ age                          : num  47 52 32 53 56 45 45 49 47 46 ...
##  $ sex                          : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ migraine                     : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ chronicity                   : num  35 8 14 10 40 27 30 49 42 3 ...
##  $ treatment.group              : Factor w/ 2 levels "Acupuncture",..: 1 2 2 2 2 1 1 1 2 1 ...
##  $ score.baseline               : num  10.8 9.5 16 32.5 16.5 ...
##  $ score.baseline.4             : Factor w/ 4 levels "[6.75,15.2]",..: 1 1 2 3 2 1 4 3 2 3 ...
##  $ age.group                    : Factor w/ 4 levels "18-34","35-44",..: 3 3 1 3 4 3 3 3 3 3 ...
##  $ score.month3                 : num  NA NA NA 44 17.5 ...
##  $ score.month12                : num  NA NA 15.3 NA 23.2 ...
##  $ withdrawal.reason            : Factor w/ 7 levels "adverse effects",..: 5 7 NA 7 NA NA NA NA NA NA ...
##  $ completedacupuncturetreatment: num  NA NA NA NA NA 1 NA NA NA NA ...
##  $ completer                    : num  0 0 1 0 1 1 1 1 1 1 ...
##  $ total.therap.visits          : num  NA 2 0 2 0 0 7 0 0 10 ...
##  $ total.gp.visits              : num  NA 4 0 0 0 5 1 0 1 0 ...
##  $ total.spec.visits            : num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ total.days.sick              : num  NA 6 3 NA 23 2 6 9 19 0 ...
#Display the treatment group frequencies
table(Acupuncture$treatment.group)
## 
## Acupuncture     Control 
##         202         194
#Generate summaries of the variables by treatment group and save results as baselines
baselines <- compareGroups::compareGroups(treatment.group ~ score.baseline + age + sex, data = Acupuncture)

#Use the createTable function to display the results saved in baselines
baseline.table <- compareGroups::createTable(baselines, show.ratio = FALSE, show.p.overall=FALSE)

#Display the created summary table
baseline.table
## 
## --------Summary descriptives table by 'treatment.group'---------
## 
## ______________________________________ 
##                Acupuncture   Control   
##                   N=202       N=194    
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## score.baseline 25.8 (15.5) 27.6 (16.8) 
## age            45.6 (10.6) 45.3 (11.5) 
## sex:                                   
##     Female     169 (83.7%) 164 (84.5%) 
##     Male       33 (16.3%)  30 (15.5%)  
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
#Generate a variable for the change from baseline at 12 months
Acupuncture$diff.month12 <- Acupuncture$score.month12 - Acupuncture$score.baseline

#Use the new variable to generate the percentage change from baseline at 12 months
Acupuncture$pct.month12 <- Acupuncture$diff.month12 / Acupuncture$score.baseline * 100

#Generate a histogram for percentage change from baseline within each treatment group
ggplot(data=Acupuncture, aes(x=pct.month12)) + 
  geom_histogram(fill="white", color="black") + facet_wrap( ~ treatment.group) +
  xlab("Percentage Change from Baseline at Month 12")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 100 rows containing non-finite values (stat_bin).

#Generate the binary response variable. 
Acupuncture$resp35.month12 <- ifelse(Acupuncture$pct.month12 < (-35), 1, 0)

#Encode this new variable as a factor.
Acupuncture$resp35.month12 <- factor(Acupuncture$resp35.month12, 
                                     levels = c(1,0), 
                                     labels=c("greater than 35%", "less than or eq to 35%")
                                     )

#Tabulate the numbers and percentages of patients in each category. 
Acupuncture %>% 
  group_by(resp35.month12) %>% 
  filter(!is.na(resp35.month12)) %>%
  summarise(n = n()) %>% 
  mutate(pct = n / sum(n)*100)
## Warning: Factor `resp35.month12` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 2 x 3
##   resp35.month12             n   pct
##   <fct>                  <int> <dbl>
## 1 greater than 35%         132  44.6
## 2 less than or eq to 35%   164  55.4
#Dichotomize the variable for complementary therapist visits into 0 or at least 1 visit.
Acupuncture$any.therap.visits <- ifelse(Acupuncture$total.therap.visits == 0, 0, 1)

#Encode the new variable as a factor
Acupuncture$any.therap.visits <- factor(Acupuncture$any.therap.visits, 
                                        levels = c(0,1), 
                                        labels=c("Did not visit CT", "Visited CT")
                                        )

#Dichotomize the variable for complementary therapist visits into 0 or at least 1 visit.
Acupuncture$any.gp.visits <- ifelse(Acupuncture$total.gp.visits == 0, 0, 1)

#Encode the new variable as a factor
Acupuncture$any.gp.visits <- factor(Acupuncture$any.gp.visits, 
                                        levels = c(0,1), 
                                        labels=c("Did not visit GP", "Visited GP")
                                        )

#Dichotomize the variable for complementary therapist visits into 0 or at least 1 visit.
Acupuncture$any.spec.visits <- ifelse(Acupuncture$total.spec.visits == 0, 0, 1)

#Encode the new variable as a factor
Acupuncture$any.spec.visits <- factor(Acupuncture$any.spec.visits, 
                                        levels = c(0,1), 
                                        labels=c("Did not visit specialist", "Visited specialist")
                                        )

#Generate a combined binary endpoint for having any professional visits. 
Acupuncture$combined <- ifelse(Acupuncture$any.therap.visits=="Did not visit CT" &
                                   Acupuncture$any.gp.visits=="Did not visit GP" & 
                                   Acupuncture$any.spec.visits=="Did not visit specialist", 0, 1
                               )

#Encode the new variable as a factor
Acupuncture$combined <- factor(Acupuncture$combined, 
                               levels = c(0,1), 
                               labels=c("No visits", "At least one visit")
                               )

#Tabulate the new composite endpoint.
table(Acupuncture$combined, useNA="ifany")
## 
##          No visits At least one visit               <NA> 
##                118                211                 67
#Perform the t-test, assuming the variances are equal in the treatment groups
t.test(pct.month12 ~ treatment.group, var.equal=TRUE, data = Acupuncture)
## 
##  Two Sample t-test
## 
## data:  pct.month12 by treatment.group
## t = -3.3531, df = 294, p-value = 0.0009039
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -25.176008  -6.552921
## sample estimates:
## mean in group Acupuncture     mean in group Control 
##                 -32.36822                 -16.50376
#Use the compareGroups function to save a summary of the results in pct.month12.test
pct.month12.test <- compareGroups::compareGroups(treatment.group ~ pct.month12, data = Acupuncture)

#Use the createTable function to summarize and store the results saved in pct.month12.test.
pct.month12.table <- compareGroups::createTable(pct.month12.test, show.ratio = FALSE, show.p.overall=TRUE)

#Display the results of pct.month12.table
pct.month12.table
## 
## --------Summary descriptives table by 'treatment.group'---------
## 
## _________________________________________________ 
##              Acupuncture     Control    p.overall 
##                 N=158         N=138               
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## pct.month12 -32.37 (42.3) -16.50 (38.6)   0.001   
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
#Use a boxplot to visualize the total days off sick by treatment group.  
ggplot(data=Acupuncture, aes(x=treatment.group, y=total.days.sick)) + 
  geom_boxplot(fill="white", color="black") +
  ylab("Total days off sick") +  xlab("Treatment group")
## Warning: Removed 68 rows containing non-finite values (stat_boxplot).

#Use the Wilcoxon Rank Sum test to compare the two distributions.
wilcox.test(total.days.sick ~ treatment.group, data=Acupuncture)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  total.days.sick by treatment.group
## W = 11991, p-value = 0.09957
## alternative hypothesis: true location shift is not equal to 0
#Perform the test of proportions on resp35.month12 by treatment.group.
prop.test(table(Acupuncture$treatment.group, Acupuncture$resp35.month12), correct=FALSE)
## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(Acupuncture$treatment.group, Acupuncture$resp35.month12)
## X-squared = 15.032, df = 1, p-value = 0.0001057
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.1143954 0.3346965
## sample estimates:
##    prop 1    prop 2 
## 0.5506329 0.3260870
#Use the tidy function to store and display a summary of the test results.
resp35.month12.test <- broom::tidy(prop.test(table(Acupuncture$treatment.group, 
                                                   Acupuncture$resp35.month12
                                                   ), correct=FALSE
                                             )
                                   )
resp35.month12.test
## # A tibble: 1 x 9
##   estimate1 estimate2 statistic p.value parameter conf.low conf.high method
##       <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr> 
## 1     0.551     0.326      15.0 1.06e-4         1    0.114     0.335 2-sam~
## # ... with 1 more variable: alternative <chr>
#Calculate the treatment difference
resp35.month12.test$estimate1 - resp35.month12.test$estimate2
## [1] 0.224546

Chapter 2 - Trial Designs

Randomization Methods:

  • Good randomization ensures the groups are appropriately stratified
  • Simple Randomization has every patient with a 50/50 chance of assignment to either group, regardless of how many are already in each group
    • treatment <- c(“A”,“B”)
    • simple.list <- sample(treatment, 20, replace=TRUE)
    • cat(simple.list,sep=“”)
  • May want to instead use blocks, where each block has equal numbers
    • library(blockrand)
    • block.list <- blockrand(n=20, num.levels = 2,block.sizes = c(2,2))
    • Can further randomize the block sizes to avoid the issue of predictability at the end of the block
    • block.list2 <- blockrand(n=20, num.levels = 2,block.sizes = c(1,2))
  • May want to instead used stratified randomization
    • Age group, geographical region, disease severity, etc.
    • over50.severe.list <- blockrand(n=100, num.levels = 2, block.sizes = c(1,2,3,4), stratum=‘Over 50, Severe’, id.prefix=‘O50_S’, block.prefix=‘O50_S’)

Crossover, Factorial, Cluster Randomized Trials:

  • Sometimes a desire to have each patient be their own control after a “washout” period (long enough to reverse effects) - called a crossover trial
    • May increase precision, eliminate inter-pateitn variability, and allow for smaller sample sizes
    • However, orders of treatments may impact outcomes, washout periods may be too short, and patients may fall out before finishing
  • Factorial designs are such that patients may receive A, B, A and B, or placebo only
  • Sometimes desirable to calculate the odds ratios, especially if the treatments are independent
    • Odds of Recovery = nRecover / nNotRecover
    • epitools::oddsratio.wald(recovery.trial\(A, recovery.trial\)recover)
  • May also want to run cluster-level trials, such as by school or by hospital
    • Can run in to problems with sample sizes (need to be large, may not have enough, etc.)

Equivalence and Non-Inferiority Trials:

  • Objective of an equivalence trial is to show similar efficacy - for example, where a generic is being released
    • Need to pre-specify a maximum acceptable difference (delta) among the groups
    • Equivalence is when the confidence intervals are all inside the deltas
    • Non-inferiority is when at least one of the confidence intervals is outside delta, even with the point estimate being inside
    • prop.test(table(infection.trial\(Treatment,infection.trial\)Infection), alternative = “less”, conf.level = 0.95, correct=FALSE)
    • prop.test(table(infection.trial\(Treatment,infection.trial\)Infection), alternative = “greater”, conf.level = 0.95, correct=FALSE)
    • prop.test(table(infection.trial\(Treatment,infection.trial\)Infection), alternative = “two.sided”, conf.level = 0.90, correct=FALSE)
  • Sometimes the only objective is to show non-inferiority with a one-sided test (Ha: new treatment wose then existing), often at the 2.5% level
    • prop.test(table(infection.trial\(Treatment,infection.trial\)Infection), alternative = “less”, conf.level = 0.975, correct=FALSE)
  • Need to state in advance the delta, the number of sides, and the significance levels
    • Lack of superiority does NOT imply equivalence

Bioequivalence trials:

  • Bioequivalence is determined by blood draws after someone has taken a drug - pharmacokinetics (PK)
    • Absorbtion, Excretion
    • Assumption is often made that similar PK profiles will lead to similar safety and efficacy, saving time on Phase III trials
  • PK profiles are often assessed based on key statistics
    • Cmax - highest concentration
    • Tmax - time to highest concentration
    • T1/2 - half-life
    • AUC - area under the curve
    • Crossover designs are frequently used, with washouts being many times greater than the half-life
  • The AUC is often calculated using the trapezoidal method - objective is to be between (0.8, 1.25) of the reference drug for the 90% CI
    • library(PKNCA)
    • pk.calc.auc(PKData\(plasma.conc.n, PKData\)rel.time, interval=c(0.25, 12), method=“linear”)

Example code includes:

#Generate a vector to store treatment labels "A" and "B"
set.seed(123)
arm<-c("A", "B")

#Randomly select treatment arm 14 times with the sample function and store in a vector
simple <- sample(arm, 14, replace=TRUE)

#Display the contents of the vector
simple
##  [1] "B" "B" "B" "B" "B" "A" "B" "A" "B" "A" "B" "A" "A" "A"
#Tabulate the numbers assigned to each treatment.
table(simple)
## simple
## A B 
## 6 8
#Use the blockrand function for 14 patients, two arms and block size 2.
set.seed(123)
block2 <- blockrand::blockrand(n=14, num.levels = 2,  block.prefix='B', block.sizes = c(1,1))

#Display the list.
block2
##    id block.id block.size treatment
## 1   1       B1          2         B
## 2   2       B1          2         A
## 3   3       B2          2         B
## 4   4       B2          2         A
## 5   5       B3          2         A
## 6   6       B3          2         B
## 7   7       B4          2         B
## 8   8       B4          2         A
## 9   9       B5          2         A
## 10 10       B5          2         B
## 11 11       B6          2         A
## 12 12       B6          2         B
## 13 13       B7          2         B
## 14 14       B7          2         A
#Tabulate the numbers per treatment arm.
table(block2$treatment)
## 
## A B 
## 7 7
#Use block randomization to produce lists of length 100 and random block sizes between 2 and 8.
set.seed(123)
under55 <- blockrand::blockrand(n=100, num.levels = 2, block.sizes = 1:4, 
                                id.prefix='U55', block.prefix='U55', stratum='<55y'
                                )
above55 <- blockrand::blockrand(n=100, num.levels = 2, block.sizes = 1:4, 
                                id.prefix='A55', block.prefix='A55',stratum='>=55y'
                                )

#Explore the two lists 
head(under55)
##       id stratum block.id block.size treatment
## 1 U55001    <55y    U5501          6         B
## 2 U55002    <55y    U5501          6         A
## 3 U55003    <55y    U5501          6         B
## 4 U55004    <55y    U5501          6         A
## 5 U55005    <55y    U5501          6         A
## 6 U55006    <55y    U5501          6         B
head(above55)
##       id stratum block.id block.size treatment
## 1 A55001   >=55y    A5501          6         A
## 2 A55002   >=55y    A5501          6         B
## 3 A55003   >=55y    A5501          6         A
## 4 A55004   >=55y    A5501          6         A
## 5 A55005   >=55y    A5501          6         B
## 6 A55006   >=55y    A5501          6         B
#Tabulate the numbers assigned to each treatment within each strata
table(under55$treatment)
## 
##  A  B 
## 53 53
table(above55$treatment)
## 
##  A  B 
## 51 51
fact.data <- readRDS("./RInputFiles/fact.data.Rds")
str(fact.data)
## 'data.frame':    502 obs. of  3 variables:
##  $ glutamine: Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 2 1 1 ...
##  $ selenium : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 1 1 1 2 2 ...
##  $ infection: Factor w/ 2 levels "No","Yes": 2 1 2 2 1 2 2 1 1 1 ...
#Explore the fact.data using the head function.
head(fact.data)
##   glutamine selenium infection
## 1       Yes       No       Yes
## 2        No       No        No
## 3        No      Yes       Yes
## 4       Yes       No       Yes
## 5        No      Yes        No
## 6        No       No       Yes
#Display the numbers with and without infections by supplement combination.
fact.data %>% 
    count(glutamine, selenium, infection)
## # A tibble: 8 x 4
##   glutamine selenium infection     n
##   <fct>     <fct>    <fct>     <int>
## 1 No        No       No           57
## 2 No        No       Yes          68
## 3 No        Yes      No           64
## 4 No        Yes      Yes          63
## 5 Yes       No       No           55
## 6 Yes       No       Yes          71
## 7 Yes       Yes      No           61
## 8 Yes       Yes      Yes          63
#Display the numbers and proportions with infections for those who received glutamine.
fact.data %>% 
    group_by(glutamine) %>% 
    filter(infection=="Yes") %>%
    summarise (n = n()) %>% 
    mutate(prop = n / sum(n))
## # A tibble: 2 x 3
##   glutamine     n  prop
##   <fct>     <int> <dbl>
## 1 No          131 0.494
## 2 Yes         134 0.506
#Display the numbers and proportions with infections for those who received selenium.
fact.data %>% 
    group_by(selenium) %>% 
    filter(infection=="Yes") %>%
    summarise (n = n()) %>% 
    mutate(prop = n / sum(n))
## # A tibble: 2 x 3
##   selenium     n  prop
##   <fct>    <int> <dbl>
## 1 No         139 0.525
## 2 Yes        126 0.475
#Display the numbers with and without infections by supplement combination.
fact.data %>% 
    count(glutamine, selenium, infection)
## # A tibble: 8 x 4
##   glutamine selenium infection     n
##   <fct>     <fct>    <fct>     <int>
## 1 No        No       No           57
## 2 No        No       Yes          68
## 3 No        Yes      No           64
## 4 No        Yes      Yes          63
## 5 Yes       No       No           55
## 6 Yes       No       Yes          71
## 7 Yes       Yes      No           61
## 8 Yes       Yes      Yes          63
#Display the numbers and proportions with infections for those who received glutamine.
fact.data %>% 
    group_by(infection) %>% 
    filter(glutamine=="Yes") %>%
    summarise (n = n()) %>% 
    mutate(prop = n / sum(n))
## # A tibble: 2 x 3
##   infection     n  prop
##   <fct>     <int> <dbl>
## 1 No          116 0.464
## 2 Yes         134 0.536
#Display the numbers and proportions with infections for those who received selenium.
fact.data %>% 
    group_by(infection) %>% 
    filter(selenium=="Yes") %>%
    summarise (n = n()) %>% 
    mutate(prop = n / sum(n))
## # A tibble: 2 x 3
##   infection     n  prop
##   <fct>     <int> <dbl>
## 1 No          125 0.498
## 2 Yes         126 0.502
#Calculate the effect of glutamine on infection
epitools::oddsratio.wald(fact.data$glutamine, fact.data$infection)
## $data
##          Outcome
## Predictor  No Yes Total
##     No    121 131   252
##     Yes   116 134   250
##     Total 237 265   502
## 
## $measure
##          odds ratio with 95% C.I.
## Predictor estimate     lower    upper
##       No  1.000000        NA       NA
##       Yes 1.066991 0.7515148 1.514901
## 
## $p.value
##          two-sided
## Predictor midp.exact fisher.exact chi.square
##       No          NA           NA         NA
##       Yes  0.7180246    0.7216211  0.7169009
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"
#Calculate the effect of selenium on infection
epitools::oddsratio.wald(fact.data$selenium, fact.data$infection)
## $data
##          Outcome
## Predictor  No Yes Total
##     No    112 139   251
##     Yes   125 126   251
##     Total 237 265   502
## 
## $measure
##          odds ratio with 95% C.I.
## Predictor  estimate     lower    upper
##       No  1.0000000        NA       NA
##       Yes 0.8122014 0.5718144 1.153646
## 
## $p.value
##          two-sided
## Predictor midp.exact fisher.exact chi.square
##       No          NA           NA         NA
##       Yes  0.2469929    0.2833307  0.2451355
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"
relapse.trial <- data.frame(Treatment=rep(c("New", "Standard"), times=c(264, 263)), 
                            Relapse=rep(rep(c("At least one relapse", "No relapse"), times=2), 
                                        times=c(184, 80, 169, 94)
                                        ), 
                            stringsAsFactors = TRUE
                            )
str(relapse.trial)
## 'data.frame':    527 obs. of  2 variables:
##  $ Treatment: Factor w/ 2 levels "New","Standard": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Relapse  : Factor w/ 2 levels "At least one relapse",..: 1 1 1 1 1 1 1 1 1 1 ...
table(relapse.trial)
##           Relapse
## Treatment  At least one relapse No relapse
##   New                       184         80
##   Standard                  169         94
#Use the head function to explore the relapse.trial dataset
head(relapse.trial)
##   Treatment              Relapse
## 1       New At least one relapse
## 2       New At least one relapse
## 3       New At least one relapse
## 4       New At least one relapse
## 5       New At least one relapse
## 6       New At least one relapse
#Calculate the number of percentages of relapse by treatment group
relapse.trial %>% 
    group_by(Treatment, Relapse) %>% 
    summarise(n = n()) %>% 
    mutate(pct = (n / sum(n))*100)
## # A tibble: 4 x 4
## # Groups:   Treatment [2]
##   Treatment Relapse                  n   pct
##   <fct>     <fct>                <int> <dbl>
## 1 New       At least one relapse   184  69.7
## 2 New       No relapse              80  30.3
## 3 Standard  At least one relapse   169  64.3
## 4 Standard  No relapse              94  35.7
#Calculate the two-sided 90% confidence interval for the difference
prop.test(table(relapse.trial$Treatment, relapse.trial$Relapse), 
          alternative = "two.sided", conf.level=0.9, correct=FALSE
          )
## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(relapse.trial$Treatment, relapse.trial$Relapse)
## X-squared = 1.7619, df = 1, p-value = 0.1844
## alternative hypothesis: two.sided
## 90 percent confidence interval:
##  -0.01289979  0.12166808
## sample estimates:
##    prop 1    prop 2 
## 0.6969697 0.6425856
PKData <- readRDS("./RInputFiles/PKData.Rds")
str(PKData)
## Classes 'tbl_df', 'tbl' and 'data.frame':    12 obs. of  6 variables:
##  $ subject.id : num  1001 1001 1001 1001 1001 ...
##  $ sample.id  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ time       : chr  "predose" "15min" "30min" "1h" ...
##  $ rel.time   : num  -0.25 0.25 0.5 1 1.5 2 3 4 6 8 ...
##  $ plasma.conc: chr  "<LLQ" "2.19" "7.28" "12.98" ...
##  $ unit       : chr  "ng/mL" "ng/mL" "ng/mL" "ng/mL" ...
#Display the dataset contents
head(PKData)
## # A tibble: 6 x 6
##   subject.id sample.id time    rel.time plasma.conc unit 
##        <dbl>     <dbl> <chr>      <dbl> <chr>       <chr>
## 1       1001         1 predose    -0.25 <LLQ        ng/mL
## 2       1001         2 15min       0.25 2.19        ng/mL
## 3       1001         3 30min       0.5  7.28        ng/mL
## 4       1001         4 1h          1    12.98       ng/mL
## 5       1001         5 1.5h        1.5  10.76       ng/mL
## 6       1001         6 2h          2    9.01        ng/mL
#Store a numeric version of the concentration variable in plasma.conc.n
PKData$plasma.conc.n <- as.numeric(PKData$plasma.conc)
## Warning: NAs introduced by coercion
#Use ggplot to plot the concentration levels against relative time
ggplot(data=PKData, aes(x=rel.time, y=plasma.conc.n)) + 
    geom_line() +
    geom_point() + ggtitle("Individual Concentration Profile") +
    xlab("Time Relative to First Dose, h") + 
    ylab("Plasma Concentration, ng/mL")
## Warning: Removed 2 rows containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).

#Use the summary function to find the max concentration level
summary(PKData$plasma.conc.n)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.610   2.388   5.315   5.750   8.578  12.980       2
#Use pk.calc.tmax to find when Cmax occurred, specifying the concentration and time.
PKNCA::pk.calc.tmax(PKData$plasma.conc.n, PKData$rel.time)
## [1] 1
#Use pk.calc.cmax to estimate AUC between 0.25 and 12hrs.
PKNCA::pk.calc.auc(PKData$plasma.conc.n, PKData$rel.time, interval=c(0.25, 12), method="linear")
## [1] 43.33125

Chapter 3 - Sample Size and Power

Introduction to Sample Size and Power:

  • Only a sample from the target population is appropriate, and it needs to be appropriately sized (too big and too small are both problems)
  • Need to understand the requirements of the trial, endpoints, statistical techniques to be used, smallest clinical meaningful difference, variability, and the significance level as well as the power
  • Type I error is rejecting a true null hypothesis - significance level, often set to 0.05
  • Type II error is failing to reject a false null hypothesis - power is 1 minus Type II error and is usually targeted as 0.8 or 0.9
    • power.t.test(delta=3, sd=10, power=0.8, type = “two.sample”, alternative = “two.sided”)
    • power.t.test(delta=3, sd=10, power=0.9, type = “two.sample”, alternative = “two.sided”)
  • May want to instead run a test of proportions
    • power.prop.test(p1=0.3, p2=0.15, power=0.8)

Sample Size Adjustments:

  • The alternative hypothesis is often, but not always, based on a two-sided test
  • If the alternative hypothesis is one-sided, then this should be incproproated in the study design
    • power.t.test(delta=3, sd=10, power=0.8, type = “two.sample”, alternative = “one.sided”)
  • May want to have unequal group sizes (non 1:1 ratio) if it produces better recruitment or compliance
    • n.ttest(power = 0.8, alpha = 0.05, mean.diff = 3, sd1 = 10, sd2 = 10, k = 0.5, design = “unpaired”, fraction = “unbalanced”) # k of 0.5 means a ratio of 2
  • Can also make adjustments for unequal variances
    • n.ttest(power = 0.8, alpha = 0.05, mean.diff = 3, sd1 = 9.06, sd2 = 9.06, k = 1, design = “unpaired”, fraction = “balanced”)
  • There are inevitably drop-outs from clinical trials - ratio is called Q
    • Sample size needs to be inflated by 1 / (1 - Q)
    • orig.n <- power.t.test(delta=3, sd=10, power=0.8, type = “two.sample”, alternative = “one.sided”)$n
    • ceiling(orig.n/(1-0.1)) # assuming Q = 0.1

Interim Analyses and Stopping Rules:

  • Patient recruitment often occurs over a time period of years; can regularly monitor the study prior to completion
    • May want to stop early if the evidence is very strong for superiority, inferiority, side effects, futility, etc.
  • Interim analyses typically require increasing the same size - Type I error increases with more chances to reject
  • The Pocock rule is also known as the “fixed nominal” rule
    • library(gsDesign)
    • Pocock <- gsDesign(k=3, test.type=2, sfu=“Pocock”) # sfu is the spending function and k=3 means 2 interim and 1 final
    • 2*(1-pnorm(Pocock\(upper\)bound))
    • Pocock.ss <- gsDesign(k=3, test.type=2, sfu=“Pocock”, n.fix=200, beta=0.1)
    • ceiling(Pocock.ss$n.I)
  • Can instead use the O’Brien-Fleming rule which has increasing p-value hurdles as the sample size increases (most of the budget is saved for the full and final sample)
    • OF <- gsDesign(k=3, test.type=2, sfu=“OF”)
    • 2*(1-pnorm(OF\(upper\)bound))
    • OF.ss <- gsDesign(k=3, test.type=2, sfu=“OF”, n.fix=200, beta=0.1)
    • ceiling(OF.ss$n.I)

Sample Size for Alternative Trial Designs:

  • Goal of an equivalence trial is to prove similarity to within a maximum specified delta
    • library(TOSTER)
    • powerTOSTtwo.prop(alpha = 0.05, statistical_power = 0.9, prop1 = 0.7, prop2 = 0.7, low_eqbound_prop = -0.05, high_eqbound_prop = 0.05)
    • powerTOSTtwo.raw(alpha=0.05, statistical_power=0.8, sdpooled=15, low_eqbound=-3,high_eqbound=3)
  • May want to instead run a cluster-level randomized trial
    • CRTSize::n4means(delta=1, sigma=2.5, m=25, ICC=0.1, alpha=0.05, power=0.90)
  • May want to instead run a factorial design randomized trial - powered such that it assumes independence (would not detect an interaction effect)
    • power.prop.test(p1=0.40, p2=0.25, power=0.9)
    • power.prop.test(p1=0.40, p2=0.23, power=0.9)

Example code includes:

#Generate the sample size for delta of 1, with SD of 3 and 80% power.
ss1 <- power.t.test(delta=1, sd=3, power=0.8)
ss1
## 
##      Two-sample t test power calculation 
## 
##               n = 142.2466
##           delta = 1
##              sd = 3
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
#Round up and display the numbers needed per group
ceiling(ss1$n)
## [1] 143
#Use the sample size from above to show that it provides 80% power
power.t.test(n=ceiling(ss1$n), delta=1, sd=3)
## 
##      Two-sample t test power calculation 
## 
##               n = 143
##           delta = 1
##              sd = 3
##       sig.level = 0.05
##           power = 0.802082
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
#Generate a vector containing values between 0.5 and 2.0, incrementing by 0.25
delta <- seq(0.5, 2, 0.25)
npergp <- NULL

#Specify the standard deviation and power
for(i in 1:length(delta)){
  npergp[i] <- ceiling(power.t.test(delta = delta[i], sd = 3, power = 0.8)$n)
}

#Create a data frame from the deltas and sample sizes
sample.sizes <- data.frame(delta, npergp)

#Plot the patients per group against the treatment differences
ggplot(data=sample.sizes, aes(x=delta, y=npergp)) + 
    geom_line() + 
    geom_point() + 
    ggtitle("Sample Size Scenarios") + 
    xlab("Treatment Difference") + 
    ylab("Patients per Group")

#Use the power.prop.test to generate sample sizes for the proportions
power.prop.test(p1 = 0.4, p2 = 0.6, power = 0.8)
## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 96.92364
##              p1 = 0.4
##              p2 = 0.6
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
#Find the minimum detectable percentage for the above using 150 patients per group.
power.prop.test(p1 = 0.4, power = 0.8, n = 150)$p2*100
## [1] 56.0992
#Use 90% power, delta 1.5, standard deviations of 2.5, fraction of 0.5
unequalgps <- samplesize::n.ttest(power = 0.9, alpha = 0.05, mean.diff = 1.5, 
                                  sd1 = 2.5, sd2 = 2.5, k = 0.5, 
                                  design = "unpaired", fraction = "unbalanced"
                    )
unequalgps
## $`Total sample size`
## [1] 135
## 
## $`Sample size group 1`
## [1] 90
## 
## $`Sample size group 2`
## [1] 45
## 
## $Fraction
## [1] 0.5
#Generate sample sizes comparing the proportions using a two-sided test
two.sided <- power.prop.test(p1=0.1, p2=0.3, power=0.8, alternative = "two.sided")
two.sided
## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 61.5988
##              p1 = 0.1
##              p2 = 0.3
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
ceiling(two.sided$n)
## [1] 62
#Repeat using a one-sided test
one.sided <- power.prop.test(p1=0.1, p2=0.3, power=0.8, alternative = "one.sided")
one.sided
## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 48.40295
##              p1 = 0.1
##              p2 = 0.3
##       sig.level = 0.05
##           power = 0.8
##     alternative = one.sided
## 
## NOTE: n is number in *each* group
ceiling(one.sided$n)
## [1] 49
#Display the reduction per group
ceiling(two.sided$n)- ceiling(one.sided$n)
## [1] 13
#Use the gsDesign function to generate the p-values for four analyses under the Pocock rule
Pocock <- gsDesign::gsDesign(k=4, test.type=2, sfu="Pocock")
Pocock
## Symmetric two-sided group sequential design with
## 90 % power and 2.5 % Type I Error.
## Spending computations assume trial stops
## if a bound is crossed.
## 
##            Sample
##             Size 
##   Analysis Ratio*  Z   Nominal p  Spend
##          1  0.296 2.36    0.0091 0.0091
##          2  0.592 2.36    0.0091 0.0067
##          3  0.887 2.36    0.0091 0.0051
##          4  1.183 2.36    0.0091 0.0041
##      Total                       0.0250 
## 
## ++ alpha spending:
##  Pocock boundary.
## * Sample size ratio compared to fixed design with no interim
## 
## Boundary crossing probabilities and expected sample size
## assume any cross stops the trial
## 
## Upper boundary (power or Type I Error)
##           Analysis
##    Theta      1      2      3      4 Total   E{N}
##   0.0000 0.0091 0.0067 0.0051 0.0041 0.025 1.1561
##   3.2415 0.2748 0.3059 0.2056 0.1136 0.900 0.6975
## 
## Lower boundary (futility or Type II Error)
##           Analysis
##    Theta      1      2      3      4 Total
##   0.0000 0.0091 0.0067 0.0051 0.0041 0.025
##   3.2415 0.0000 0.0000 0.0000 0.0000 0.000
2*(1-pnorm(Pocock$upper$bound))
## [1] 0.01821109 0.01821109 0.01821109 0.01821109
#Repeat for the the O'Brein & Fleming rule
OF <- gsDesign::gsDesign(k=4, test.type=2, sfu="OF")
OF
## Symmetric two-sided group sequential design with
## 90 % power and 2.5 % Type I Error.
## Spending computations assume trial stops
## if a bound is crossed.
## 
##            Sample
##             Size 
##   Analysis Ratio*  Z   Nominal p  Spend
##          1  0.256 4.05    0.0000 0.0000
##          2  0.511 2.86    0.0021 0.0021
##          3  0.767 2.34    0.0097 0.0083
##          4  1.022 2.02    0.0215 0.0145
##      Total                       0.0250 
## 
## ++ alpha spending:
##  O'Brien-Fleming boundary.
## * Sample size ratio compared to fixed design with no interim
## 
## Boundary crossing probabilities and expected sample size
## assume any cross stops the trial
## 
## Upper boundary (power or Type I Error)
##           Analysis
##    Theta     1      2      3      4 Total   E{N}
##   0.0000 0.000 0.0021 0.0083 0.0145 0.025 1.0157
##   3.2415 0.008 0.2850 0.4031 0.2040 0.900 0.7674
## 
## Lower boundary (futility or Type II Error)
##           Analysis
##    Theta 1      2      3      4 Total
##   0.0000 0 0.0021 0.0083 0.0145 0.025
##   3.2415 0 0.0000 0.0000 0.0000 0.000
2*(1-pnorm(OF$upper$bound))
## [1] 5.152685e-05 4.199337e-03 1.941553e-02 4.293975e-02
#Use the gsDesign function to generate the sample sizes at each stage under the Pocock rule
Pocock.ss <- gsDesign::gsDesign(k=4, test.type=2, sfu="Pocock", n.fix=500, beta=0.1)
ceiling(Pocock.ss$n.I)
## [1] 148 296 444 592
#Repeat for the the O'Brein-Fleming rule
OF.ss <- gsDesign::gsDesign(k=4, test.type=2, sfu="OF", n.fix=500, beta=0.1)
ceiling(OF.ss$n.I)
## [1] 128 256 384 512
#Find the sample size  per group for expected rates of 60%, 4% delta, 90% power and 5% significance level.
TOSTER::powerTOSTtwo.prop(alpha = 0.05, statistical_power = 0.9, prop1 = 0.6, prop2 = 0.6, 
                          low_eqbound_prop = -0.04, high_eqbound_prop = 0.04
                          )
## The required sample size to achieve 90 % power with equivalence bounds of -0.04 and 0.04 is 3247
## 
## [1] 3246.652
#Find the power if the above trial is limited to 2500 per group
TOSTER::powerTOSTtwo.prop(alpha = 0.05, N=2500, prop1 = 0.6, prop2 = 0.6, 
                          low_eqbound_prop = -0.04, high_eqbound_prop = 0.04
                          )
## The statistical power is 78.57 % for equivalence bounds of -0.04 and 0.04 .
## 
## [1] 0.7857316
#Find the sample size for a standard deviation of 10, delta of 2, 80% power and 5% significance level.
TOSTER::powerTOSTtwo.raw(alpha=0.05, statistical_power=0.8, sdpooled=10, low_eqbound=-2, high_eqbound=2)
## The required sample size to achieve 80 % power with equivalence bounds of -2 and 2 is 428.1924 per group, or 858 in total.
## 
## [1] 428.1924
#Find the sample sizes based on standard deviations between 7 and 13.
stdev <- seq(7, 13, 1)
npergp <- NULL
for(i in 1:length(stdev)){
    npergp[i] <- ceiling(TOSTER::powerTOSTtwo.raw(alpha=0.05, statistical_power=0.8, sdpooled=stdev[i],
                                                  low_eqbound=-2, high_eqbound=2
                                                  )
                         )
}
## The required sample size to achieve 80 % power with equivalence bounds of -2 and 2 is 209.8143 per group, or 420 in total.
## 
## The required sample size to achieve 80 % power with equivalence bounds of -2 and 2 is 274.0431 per group, or 550 in total.
## 
## The required sample size to achieve 80 % power with equivalence bounds of -2 and 2 is 346.8358 per group, or 694 in total.
## 
## The required sample size to achieve 80 % power with equivalence bounds of -2 and 2 is 428.1924 per group, or 858 in total.
## 
## The required sample size to achieve 80 % power with equivalence bounds of -2 and 2 is 518.1128 per group, or 1038 in total.
## 
## The required sample size to achieve 80 % power with equivalence bounds of -2 and 2 is 616.597 per group, or 1234 in total.
## 
## The required sample size to achieve 80 % power with equivalence bounds of -2 and 2 is 723.6451 per group, or 1448 in total.
## 
sample.sizes <- data.frame(stdev, npergp)

#Plot npergp against stdev
ggplot(data=sample.sizes, aes(x=stdev, y=npergp)) + 
    geom_line() +
    geom_point() + 
    ggtitle("Equivalence Sample Size Scenarios") +
    xlab("Standard Deviation") + 
    ylab("Patients per Group")


Chapter 4 - Statistical Analysis

Regression Analysis:

  • May be additional explanatory variables associated with outcomes of interest; these could differ by treatment arms (“explanatory variables”)
  • May be interested to see whether particular variables are associated with trial endpoints of particular interest
  • Can use simple linear regression to investigate the treatment effects
    • asthma.trial\(group <- relevel(asthma.trial\)group, ref=“Placebo”)
    • asthma.reg1 <- lm(fev.change ~group, asthma.trial)
    • summary(asthma.reg1)
    • t.test(fev.change~group, var.equal=TRUE, data=asthma.trial)
    • asthma.reg2<-lm(fev.change ~group + age, asthma.trial)
    • summary(asthma.reg2)
  • May want to use logistic regression to model binary outcomes
    • asthma.logreg1=glm(attack~group + age, family=binomial(link=“logit”), asthma.trial)
    • summary(asthma.logreg1)
    • exp(coefficients(asthma.logreg1)[2])
    • exp(confint(asthma.logreg1)[2,])

Analysis Sets, Subgroups, and Interactions:

  • Patient adherence may be imperfect for many reasons; as such, analysis is often re-done for sub-groups
  • Intention to treat (ITT) means looking at patient outcomes according to planned treatments rather than just those who complied and completed treatment
    • Full Analysis Set (FAS) follows IT principles
    • Per-Protocol Set (PPS) does not follow ITT principles and instead includes everyone
  • Can look at both FAS and PPS to compare outcomes
    • asthma.fas<-lm(fev.change ~group , asthma.trial)
    • asthma.pp<-lm(fev.change ~group , asthma.trial, subset = pp==1)
  • Can also look at subgroup analyses
    • asthma.u65<-lm( fev.change ~group , asthma.trial, subset = age<65)
    • asthma.o65<-lm( fev.change ~group , asthma.trial, subset = age>=65)
  • There is a risk of p-hacking with subgroup analyses, so they should be pre-specified rather than ad hoc
  • May also want to consider interaction effects for the trial
    • asthma.ageg <- lm( fev.change ~group + agegroup , asthma.trial)
    • summary(asthma.ageint)

Multiplicity of Data:

  • Multiple subgroups of data - patients, looks, sub-groups, etc.
  • Goal is to keep the probability of Type I error low - roughly 0.05
    • Multiplicity may make Type I errors cumulatively much more likely
  • Can adjust the p-values to maintain the desired Type I error rate
    • p / n (Bonferroni) for n tests with a Type I error rate of p
  • There is typically a lack of power within the subgroups due to how the study is designed
    • Subgroups should thus be limited, and based on hypotheses driven by previous research
  • There are often multiple endpoints for a study, so there needs to be clarity about which are confirmatory and which are exploratory
  • Composite endpoints can increase statistical power - for example, adding the causes of cardiac death
  • May also have repeated measurements due to collecting data at multiple timepoints

Wrap up:

  • Well-conducted human clinical trials are common in the medical indistry and need to follow rigorous protocols and analyses
    • Randomization, including several different methods
    • Clinical trial designs such as cross-over, equivalence, and non-inferiority
    • Sample size determination
    • Statistical analysis for the overll group and sub-groups
    • T-tests, Wilcoxon rank tests, test for equal proportions, logistic regression

Example code includes:

#Explore the variable names with the str function
str(Acupuncture)
## 'data.frame':    396 obs. of  25 variables:
##  $ id                           : num  100 101 104 105 108 112 113 114 126 130 ...
##  $ age                          : num  47 52 32 53 56 45 45 49 47 46 ...
##  $ sex                          : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ migraine                     : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ chronicity                   : num  35 8 14 10 40 27 30 49 42 3 ...
##  $ treatment.group              : Factor w/ 2 levels "Acupuncture",..: 1 2 2 2 2 1 1 1 2 1 ...
##  $ score.baseline               : num  10.8 9.5 16 32.5 16.5 ...
##  $ score.baseline.4             : Factor w/ 4 levels "[6.75,15.2]",..: 1 1 2 3 2 1 4 3 2 3 ...
##  $ age.group                    : Factor w/ 4 levels "18-34","35-44",..: 3 3 1 3 4 3 3 3 3 3 ...
##  $ score.month3                 : num  NA NA NA 44 17.5 ...
##  $ score.month12                : num  NA NA 15.3 NA 23.2 ...
##  $ withdrawal.reason            : Factor w/ 7 levels "adverse effects",..: 5 7 NA 7 NA NA NA NA NA NA ...
##  $ completedacupuncturetreatment: num  NA NA NA NA NA 1 NA NA NA NA ...
##  $ completer                    : num  0 0 1 0 1 1 1 1 1 1 ...
##  $ total.therap.visits          : num  NA 2 0 2 0 0 7 0 0 10 ...
##  $ total.gp.visits              : num  NA 4 0 0 0 5 1 0 1 0 ...
##  $ total.spec.visits            : num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ total.days.sick              : num  NA 6 3 NA 23 2 6 9 19 0 ...
##  $ diff.month12                 : num  NA NA -0.667 NA 6.75 ...
##  $ pct.month12                  : num  NA NA -4.17 NA 40.91 ...
##  $ resp35.month12               : Factor w/ 2 levels "greater than 35%",..: NA NA 2 NA 2 2 2 2 2 1 ...
##  $ any.therap.visits            : Factor w/ 2 levels "Did not visit CT",..: NA 2 1 2 1 1 2 1 1 2 ...
##  $ any.gp.visits                : Factor w/ 2 levels "Did not visit GP",..: NA 2 1 1 1 2 2 1 2 1 ...
##  $ any.spec.visits              : Factor w/ 2 levels "Did not visit specialist",..: NA 1 1 1 1 1 1 1 1 1 ...
##  $ combined                     : Factor w/ 2 levels "No visits","At least one visit": NA 2 1 2 1 2 2 1 2 2 ...
#Use the relevel function to set Control as reference group
Acupuncture$treatment.group <- relevel(Acupuncture$treatment.group, ref="Control")


#Use lm to run and store the model in linreg1
linreg1 <- lm(pct.month12 ~ treatment.group + sex + score.baseline.4, data=Acupuncture, 
              na.action = na.exclude
              )

#Display the results of linreg1
summary(linreg1)
## 
## Call:
## lm(formula = pct.month12 ~ treatment.group + sex + score.baseline.4, 
##     data = Acupuncture, na.action = na.exclude)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85.71 -28.82  -4.89  24.48 130.39 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -5.805      5.460  -1.063 0.288565    
## treatment.groupAcupuncture   -16.473      4.718  -3.492 0.000555 ***
## sexMale                        2.767      6.410   0.432 0.666317    
## score.baseline.4(15.2,21.2]   -9.878      6.447  -1.532 0.126584    
## score.baseline.4(21.2,34.6]  -18.822      6.717  -2.802 0.005415 ** 
## score.baseline.4(34.6,94.8]  -15.786      6.657  -2.371 0.018375 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.23 on 290 degrees of freedom
##   (100 observations deleted due to missingness)
## Multiple R-squared:  0.06742,    Adjusted R-squared:  0.05134 
## F-statistic: 4.193 on 5 and 290 DF,  p-value: 0.001076
#Use lm to run and store the model in linreg2
linreg2 <- lm(pct.month12 ~ treatment.group + score.baseline.4, data=Acupuncture, na.action = na.exclude)

#Display the results of linreg2
summary(linreg2)
## 
## Call:
## lm(formula = pct.month12 ~ treatment.group + score.baseline.4, 
##     data = Acupuncture, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -83.371 -29.273  -4.979  24.691 129.902 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -5.435      5.385  -1.009 0.313616    
## treatment.groupAcupuncture   -16.359      4.704  -3.478 0.000582 ***
## score.baseline.4(15.2,21.2]   -9.824      6.437  -1.526 0.128053    
## score.baseline.4(21.2,34.6]  -18.850      6.707  -2.811 0.005281 ** 
## score.baseline.4(34.6,94.8]  -15.783      6.647  -2.374 0.018234 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.18 on 291 degrees of freedom
##   (100 observations deleted due to missingness)
## Multiple R-squared:  0.06682,    Adjusted R-squared:  0.05399 
## F-statistic: 5.209 on 4 and 291 DF,  p-value: 0.0004574
#Add the predicted values to the Acupuncture dataset for linreg2 using the predict function 
Acupuncture$pred.linreg2 <- predict(linreg2)


#Plot the predicted values against baseline score quartile grouping by treatment.
ggplot(data = subset(Acupuncture, !is.na(pred.linreg2)), 
       aes(x = score.baseline.4, y = pred.linreg2, group = treatment.group)
       ) + 
    geom_line(aes(color = treatment.group)) +
    geom_point(aes(color = treatment.group)) + 
    ggtitle("Predicted Values from Linear Regression Model") + 
    xlab("Baseline Score Quartile") + 
    ylab("Percentage Change from Baseline at M12")

#Use the relevel function to set "Control" as the reference for treatment
Acupuncture$treatment.group <- relevel(Acupuncture$treatment.group, ref="Control")

#Use the relevel function to set "less than or eq to 35%" as the reference for resp35.month12
Acupuncture$resp35.month12 <- relevel(Acupuncture$resp35.month12, ref="less than or eq to 35%")

#Use glm to run and store the model in logreg1
logreg1 <- glm(resp35.month12 ~ treatment.group + sex  + score.baseline.4, 
               family=binomial(link="logit"), data=Acupuncture
               )

#Display the results of logreg1
summary(logreg1)
## 
## Call:
## glm(formula = resp35.month12 ~ treatment.group + sex + score.baseline.4, 
##     family = binomial(link = "logit"), data = Acupuncture)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4033  -0.9907  -0.7770   1.1163   1.7695  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.0430     0.2935  -3.553  0.00038 ***
## treatment.groupAcupuncture    0.9734     0.2465   3.949 7.84e-05 ***
## sexMale                      -0.2882     0.3340  -0.863  0.38818    
## score.baseline.4(15.2,21.2]   0.5463     0.3359   1.626  0.10386    
## score.baseline.4(21.2,34.6]   0.5865     0.3487   1.682  0.09258 .  
## score.baseline.4(34.6,94.8]   0.2151     0.3481   0.618  0.53664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 406.88  on 295  degrees of freedom
## Residual deviance: 386.90  on 290  degrees of freedom
##   (100 observations deleted due to missingness)
## AIC: 398.9
## 
## Number of Fisher Scoring iterations: 4
#Display the odds ratio and 95% CI for Acupuncture vs Control
exp(coefficients(logreg1)[2])
## treatment.groupAcupuncture 
##                   2.646798
exp(confint(logreg1)[2,])
## Waiting for profiling to be done...
##    2.5 %   97.5 % 
## 1.640863 4.318076
#Tabulate withdrawal.reason
table(Acupuncture$withdrawal.reason,  useNA="ifany")
## 
##       adverse effects                  died  intercurrent illness 
##                     1                     1                    16 
##     lost to follow-up      treatment hassle treatment ineffective 
##                    15                     5                     4 
##      withdrew consent                  <NA> 
##                    58                   296
#Tabulate completedacupuncturetreatment by treatment.group
table(Acupuncture$completedacupuncturetreatment, Acupuncture$treatment.group,  useNA="ifany") 
##       
##        Control Acupuncture
##   0          0          35
##   1          1         131
##   <NA>     193          36
#Create a per protocol flag that is TRUE if patients met the criteria
Acupuncture <- Acupuncture %>%
    mutate(pp = is.na(withdrawal.reason) & 
               ((completedacupuncturetreatment==1 & treatment.group=="Acupuncture") | 
                    (is.na(completedacupuncturetreatment) & treatment.group=="Control")
                )
           )
Acupuncture$pp[is.na(Acupuncture$pp)] <- FALSE
Acupuncture$pp <- as.factor(Acupuncture$pp)

#Tabulate the per protocol flag
table(Acupuncture$pp)
## 
## FALSE  TRUE 
##   144   252
#Use the relevel function to set Control as reference group
Acupuncture$treatment.group <- relevel(Acupuncture$treatment.group, ref="Control")

#Use lm to run and store the model without interaction in linregnoint
linregnoint <- lm(pct.month12 ~ treatment.group + score.baseline.4, Acupuncture, na.action = na.exclude)

#Use lm to run and store the model with interaction in linregint
linregint <- lm(pct.month12 ~ treatment.group*score.baseline.4, Acupuncture, na.action = na.exclude)

#Display the results of linregnoint and linregint
summary(linregnoint)
## 
## Call:
## lm(formula = pct.month12 ~ treatment.group + score.baseline.4, 
##     data = Acupuncture, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -83.371 -29.273  -4.979  24.691 129.902 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -5.435      5.385  -1.009 0.313616    
## treatment.groupAcupuncture   -16.359      4.704  -3.478 0.000582 ***
## score.baseline.4(15.2,21.2]   -9.824      6.437  -1.526 0.128053    
## score.baseline.4(21.2,34.6]  -18.850      6.707  -2.811 0.005281 ** 
## score.baseline.4(34.6,94.8]  -15.783      6.647  -2.374 0.018234 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.18 on 291 degrees of freedom
##   (100 observations deleted due to missingness)
## Multiple R-squared:  0.06682,    Adjusted R-squared:  0.05399 
## F-statistic: 5.209 on 4 and 291 DF,  p-value: 0.0004574
summary(linregint)
## 
## Call:
## lm(formula = pct.month12 ~ treatment.group * score.baseline.4, 
##     data = Acupuncture, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.497 -28.562  -5.565  26.525 126.990 
## 
## Coefficients:
##                                                        Estimate Std. Error
## (Intercept)                                              -9.662      7.239
## treatment.groupAcupuncture                               -9.220      9.407
## score.baseline.4(15.2,21.2]                              -4.471      9.543
## score.baseline.4(21.2,34.6]                             -10.655     10.412
## score.baseline.4(34.6,94.8]                             -12.425      9.875
## treatment.groupAcupuncture:score.baseline.4(15.2,21.2]   -9.478     12.993
## treatment.groupAcupuncture:score.baseline.4(21.2,34.6]  -14.057     13.644
## treatment.groupAcupuncture:score.baseline.4(34.6,94.8]   -5.375     13.417
##                                                        t value Pr(>|t|)
## (Intercept)                                             -1.335    0.183
## treatment.groupAcupuncture                              -0.980    0.328
## score.baseline.4(15.2,21.2]                             -0.469    0.640
## score.baseline.4(21.2,34.6]                             -1.023    0.307
## score.baseline.4(34.6,94.8]                             -1.258    0.209
## treatment.groupAcupuncture:score.baseline.4(15.2,21.2]  -0.729    0.466
## treatment.groupAcupuncture:score.baseline.4(21.2,34.6]  -1.030    0.304
## treatment.groupAcupuncture:score.baseline.4(34.6,94.8]  -0.401    0.689
## 
## Residual standard error: 40.3 on 288 degrees of freedom
##   (100 observations deleted due to missingness)
## Multiple R-squared:  0.07059,    Adjusted R-squared:  0.048 
## F-statistic: 3.125 on 7 and 288 DF,  p-value: 0.003398
#Compare the models with the anova command
anova(linregnoint, linregint)
## Analysis of Variance Table
## 
## Model 1: pct.month12 ~ treatment.group + score.baseline.4
## Model 2: pct.month12 ~ treatment.group * score.baseline.4
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    291 469699                           
## 2    288 467802  3    1897.3 0.3893 0.7608
#Tabulate the age group variable to view the categories
table(Acupuncture$age.group)
## 
## 18-34 35-44 45-54 55-65 
##    75    83   159    79
#Display the adjusted significance level
0.05/4
## [1] 0.0125
#Run the Wilcoxon Rank Sum test in each of the age subgroups
age <- c("18-34", "35-44", "45-54", "55-65")
for(group in age){
  subgroup <- broom::tidy(wilcox.test(total.days.sick ~ treatment.group, 
                                      data = subset(Acupuncture, age.group==group), 
                                      exact=FALSE
                                      )
                          )
  print(group)
  print(subgroup)
}
## [1] "18-34"
## # A tibble: 1 x 4
##   statistic p.value method                                      alternative
##       <dbl>   <dbl> <chr>                                       <chr>      
## 1       518  0.0380 Wilcoxon rank sum test with continuity cor~ two.sided  
## [1] "35-44"
## # A tibble: 1 x 4
##   statistic p.value method                                      alternative
##       <dbl>   <dbl> <chr>                                       <chr>      
## 1       534   0.540 Wilcoxon rank sum test with continuity cor~ two.sided  
## [1] "45-54"
## # A tibble: 1 x 4
##   statistic p.value method                                      alternative
##       <dbl>   <dbl> <chr>                                       <chr>      
## 1      2557   0.130 Wilcoxon rank sum test with continuity cor~ two.sided  
## [1] "55-65"
## # A tibble: 1 x 4
##   statistic p.value method                                      alternative
##       <dbl>   <dbl> <chr>                                       <chr>      
## 1       630   0.678 Wilcoxon rank sum test with continuity cor~ two.sided
#Tabulate the combined endpoint by treatment group
table(Acupuncture$combined, Acupuncture$treatment.group, useNA="ifany")
##                     
##                      Control Acupuncture
##   No visits               53          65
##   At least one visit     102         109
##   <NA>                    39          28
#Use the relevel function to set Control as reference group
Acupuncture$treatment.group <- relevel(Acupuncture$treatment.group, ref="Control")

#Use compareGroups to generate and save the treatment effect for the composite endpoint amd each component
combined.test <- compareGroups::compareGroups(treatment.group ~ combined + any.therap.visits + any.gp.visits + any.spec.visits, data = Acupuncture)

# Store the results in a table
combined.test.table <- compareGroups::createTable(combined.test, show.ratio = TRUE, show.p.overall=FALSE)

#Display the results
combined.test.table
## 
## --------Summary descriptives table by 'treatment.group'---------
## 
## _____________________________________________________________________________ 
##                                Control   Acupuncture        OR        p.ratio 
##                                 N=155       N=174                             
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## combined:                                                                     
##     No visits                53 (34.2%)  65 (37.4%)        Ref.        Ref.   
##     At least one visit       102 (65.8%) 109 (62.6%) 0.87 [0.55;1.37]  0.553  
## any.therap.visits:                                                            
##     Did not visit CT         126 (81.3%) 142 (81.6%)       Ref.        Ref.   
##     Visited CT               29 (18.7%)  32 (18.4%)  0.98 [0.56;1.72]  0.940  
## any.gp.visits:                                                                
##     Did not visit GP         63 (40.6%)  85 (48.9%)        Ref.        Ref.   
##     Visited GP               92 (59.4%)  89 (51.1%)  0.72 [0.46;1.11]  0.138  
## any.spec.visits:                                                              
##     Did not visit specialist 145 (93.5%) 160 (92.0%)       Ref.        Ref.   
##     Visited specialist       10 (6.45%)  14 (8.05%)  1.26 [0.54;3.04]  0.590  
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Financial Analytics in R

Chapter 1 - Intro to Valuations (Cash is King)

Valuations Overview:

  • Can value economic value based on cash flows using R and the tidyverse
  • Focus for the course is valuation of projects within a company
  • Discounted cash flows use an interest rate to discount future expenditures and revenues to the present value
  • Cash flow statements typically have time as the column, with variables such as spend or revenue as rows
    • May want to tidy data for analysis where the column is the variable and the row is the observation

Business Models and Writing R Functions:

  • The business model includes the timing of earning revenue and incurring expenses
    • Operating revenue is driven by the units sold and the price per unit (for consumer products)
    • Operating revenue is driven by enrolment, churn, monthly fees, etc. (for subscription products)
    • Direct Expenses are driven by the units sold and the cost per unit (marginal direct costs)
    • Operating expenses (overhead) are costs required to run the business but not directly tied to a specific sale - e.g., SG&A, depreciation, etc.
  • The timing of revenues and expenses are on an accrual basis - for example, raw materials may be purchased well before production, but accrual would be when producing
  • Gross Profit = Operative Revenue minus Direct Expenses
  • Can use a function for calculating the business model
    • calc_business_model <- function(assumptions, price_per_unit, cost_per_unit){
    • model <- assumptions
    • model\(revenue <- model\)sales * price_per_unit
    • model\(direct_expense <- model\)sales * cost_per_unit
    • model\(gross_profit <- model\)revenue - model$direct_expenses
    • model
    • }
    • calc_business_model( assumptions, price_per_unit = 10, cost_per_unit = 2 )

Pro-Forma Income Statements:

  • The income statement (P&L) translates revenues and expenses to net income
    • overhead_expense <- sga + depreciation + amortization
    • operating_profit <- gross_profit - overhead_expense
  • Depreciation is the idea that large capital expenditures should be spread over their useful lifetimes
    • Amortization is a similar concept for non-physical investments (such as R&D)
    • Depreciation per Year = (Book Value - Salvage Value) / Years of Useful Life
    • depreciation_per_period <- (book_value - salvage_value)/useful_life
    • depreciation <- rep(depreciation_per_period, useful_life)
  • Can calculate valuations as either levered (financing, including tax considerations and interest) or unlevered (ignores financing considerations)
  • Generally need to include the tax considerations
    • tax <- operating_income * tax_rate
    • net_income <- operating_income - tax

Income to Cash:

  • Net income needs to be adjusted to create free cash flow
  • Cash is king - need to be able to convert the income in to no-string-attached cash
  • Income is recognized based on accrual while cash is recognized when it is received or spent
    • The income statement may majorly differ from the cash flow particularly due to depreciation
  • To convert from income to free cash flow, convert as follows
    • Add back depreciation and amortization (only care about actual spending)
    • Subtract out CAPEX (care about recognizing all of the capital asset when it is purchased)
    • Adjust for net working capital (NWC)
    • cashflow <- net_income + depreciation_exp - capex + nwc_changes

Example code includes:

# define inputs
price <- 20
print_cost <- 0.5
ship_cost <- 2

assumptions <- data.frame(year=1:5, sales=c(175, 200, 180, 100, 50))

# add revenue, expense, and profit variables
cashflow <- assumptions
cashflow$revenue <- cashflow$sales * price
cashflow$direct_expense <- cashflow$sales * (print_cost + ship_cost) 
cashflow$gross_profit <- cashflow$revenue - cashflow$direct_expense

# print cashflow
print(cashflow)
##   year sales revenue direct_expense gross_profit
## 1    1   175    3500          437.5       3062.5
## 2    2   200    4000          500.0       3500.0
## 3    3   180    3600          450.0       3150.0
## 4    4   100    2000          250.0       1750.0
## 5    5    50    1000          125.0        875.0
prem_ts <- data.frame(MONTH=1:60, COST_PER_SONG=0.01, SONG_LENGTH=3, REV_PER_SUB=10) %>%
    mutate(PCT_ACTIVE=0.95**(MONTH-1), HOURS_PER_MONTH=528-0.08*(MONTH-40)**2)

# premium business models
premium_model <- prem_ts
premium_model$SONGS_PLAYED <- prem_ts$PCT_ACTIVE * prem_ts$HOURS_PER_MONTH * 1 / prem_ts$SONG_LENGTH
premium_model$REV_SUBSCRIPTION <- prem_ts$PCT_ACTIVE * prem_ts$REV_PER_SUB
premium_model$COST_SONG_PLAYED <- premium_model$SONGS_PLAYED * prem_ts$COST_PER_SONG

# inspect results
head(premium_model)
##   MONTH COST_PER_SONG SONG_LENGTH REV_PER_SUB PCT_ACTIVE HOURS_PER_MONTH
## 1     1          0.01           3          10  1.0000000          406.32
## 2     2          0.01           3          10  0.9500000          412.48
## 3     3          0.01           3          10  0.9025000          418.48
## 4     4          0.01           3          10  0.8573750          424.32
## 5     5          0.01           3          10  0.8145062          430.00
## 6     6          0.01           3          10  0.7737809          435.52
##   SONGS_PLAYED REV_SUBSCRIPTION COST_SONG_PLAYED
## 1     135.4400        10.000000         1.354400
## 2     130.6187         9.500000         1.306187
## 3     125.8927         9.025000         1.258927
## 4     121.2671         8.573750         1.212671
## 5     116.7459         8.145062         1.167459
## 6     112.3324         7.737809         1.123324
free_ts <- data.frame(MONTH=1:60, PROP_MUSIC=0.95, REV_PER_AD=0.02, REV_PER_CLICK=10, 
                      COST_PER_SONG=0.01, SONG_LENGTH=3, AD_LENGTH=0.25, CLICK_THROUGH_RATE=0.001
                      ) %>%
    mutate(PCT_ACTIVE=0.97**(MONTH-1), HOURS_PER_MONTH=480-0.08*(MONTH-40)**2)

# freemium business models
freemium_model <- free_ts
freemium_model$SONGS_PLAYED <- free_ts$PCT_ACTIVE * free_ts$HOURS_PER_MONTH * free_ts$PROP_MUSIC / free_ts$SONG_LENGTH
freemium_model$ADS_PLAYED <- free_ts$PCT_ACTIVE * free_ts$HOURS_PER_MONTH * (1-free_ts$PROP_MUSIC) / free_ts$AD_LENGTH
freemium_model$REV_AD_PLAYED <- freemium_model$ADS_PLAYED * free_ts$REV_PER_AD
freemium_model$REV_AD_CLICKED <- freemium_model$ADS_PLAYED * free_ts$CLICK_THROUGH_RATE * free_ts$REV_PER_CLICK
freemium_model$COST_SONG_PLAYED <- freemium_model$SONGS_PLAYED * free_ts$COST_PER_SONG

# examine output
head(freemium_model)
##   MONTH PROP_MUSIC REV_PER_AD REV_PER_CLICK COST_PER_SONG SONG_LENGTH
## 1     1       0.95       0.02            10          0.01           3
## 2     2       0.95       0.02            10          0.01           3
## 3     3       0.95       0.02            10          0.01           3
## 4     4       0.95       0.02            10          0.01           3
## 5     5       0.95       0.02            10          0.01           3
## 6     6       0.95       0.02            10          0.01           3
##   AD_LENGTH CLICK_THROUGH_RATE PCT_ACTIVE HOURS_PER_MONTH SONGS_PLAYED
## 1      0.25              0.001  1.0000000          358.32     113.4680
## 2      0.25              0.001  0.9700000          364.48     111.9561
## 3      0.25              0.001  0.9409000          370.48     110.3851
## 4      0.25              0.001  0.9126730          376.32     108.7614
## 5      0.25              0.001  0.8852928          382.00     107.0909
## 6      0.25              0.001  0.8587340          387.52     105.3793
##   ADS_PLAYED REV_AD_PLAYED REV_AD_CLICKED COST_SONG_PLAYED
## 1   71.66400      1.433280      0.7166400         1.134680
## 2   70.70912      1.414182      0.7070912         1.119561
## 3   69.71693      1.394339      0.6971693         1.103851
## 4   68.69142      1.373828      0.6869142         1.087614
## 5   67.63637      1.352727      0.6763637         1.070909
## 6   66.55532      1.331106      0.6655532         1.053793
# Define function: calc_business_model
calc_business_model <- function(assumptions, price, print_cost, ship_cost){
    cashflow <- assumptions
    cashflow$revenue <- cashflow$sales * price
    cashflow$direct_expense <- cashflow$sales * (print_cost + ship_cost) 
    cashflow$gross_profit <- cashflow$revenue - cashflow$direct_expense
    cashflow
}

# Call calc_business_model function for different sales prices
assumptions
##   year sales
## 1    1   175
## 2    2   200
## 3    3   180
## 4    4   100
## 5    5    50
calc_business_model(assumptions, 20, 0.5, 2)$gross_profit
## [1] 3062.5 3500.0 3150.0 1750.0  875.0
calc_business_model(assumptions, 25, 0.5, 2)$gross_profit
## [1] 3937.5 4500.0 4050.0 2250.0 1125.0
# Inputs
production <- data.frame(Month=1:60, Units=rep(c(60, 50, 40, 30), times=15))

cost <- 100000
life <- 60
salvage <- 10000

# Compute depreciation
production$Depr_Straight <- (cost - salvage)/life
production$Depr_UnitsProd <- (cost - salvage)*(production$Units) / sum(production$Units)

# Plot two depreciation schedules
ggplot(production, aes(x = Month)) + 
    geom_line(aes(y = Depr_Straight)) + 
    geom_line(aes(y = Depr_UnitsProd))

# Business model
cashflow
##   year sales revenue direct_expense gross_profit
## 1    1   175    3500          437.5       3062.5
## 2    2   200    4000          500.0       3500.0
## 3    3   180    3600          450.0       3150.0
## 4    4   100    2000          250.0       1750.0
## 5    5    50    1000          125.0        875.0
cashflow$revenue <- cashflow$revenue + 2 * cashflow$sales
cashflow$gross_profit <- cashflow$revenue - cashflow$direct_expense

# Income statement
cashflow$depr_sl <- (1000 - 0) / 5
cashflow$operating_profit <- cashflow$gross_profit - cashflow$depr_sl
cashflow$tax <- cashflow$operating_profit * 0.3
cashflow$net_income <- cashflow$operating_profit - cashflow$tax

# Inspect dataset
cashflow
##   year sales revenue direct_expense gross_profit depr_sl operating_profit
## 1    1   175    3850          437.5       3412.5     200           3212.5
## 2    2   200    4400          500.0       3900.0     200           3700.0
## 3    3   180    3960          450.0       3510.0     200           3310.0
## 4    4   100    2200          250.0       1950.0     200           1750.0
## 5    5    50    1100          125.0        975.0     200            775.0
##       tax net_income
## 1  963.75    2248.75
## 2 1110.00    2590.00
## 3  993.00    2317.00
## 4  525.00    1225.00
## 5  232.50     542.50
# Calculate income statement
assumptions <- data.frame(unit_sales=100000*c(1, 2, 4, 8), machines_purchased=c(1, 1, 2, 4), 
                          depreciation=10000000*c(4, 8, 16, 32)
                          )
assumptions
##   unit_sales machines_purchased depreciation
## 1      1e+05                  1      4.0e+07
## 2      2e+05                  1      8.0e+07
## 3      4e+05                  2      1.6e+08
## 4      8e+05                  4      3.2e+08
price_per_unit <- 1000
cogs_per_unit <- 450
labor_per_unit <- 50

income_statement <- assumptions
income_statement$revenue <- income_statement$unit_sales * price_per_unit
income_statement$expenses <- income_statement$unit_sales * (cogs_per_unit + labor_per_unit)
income_statement$earnings <- income_statement$revenue - income_statement$expenses - income_statement$depreciation

# Summarize cumulative earnings
sum(income_statement$earnings)
## [1] 1.5e+08
sum(income_statement$earnings) / sum(income_statement$revenue)
## [1] 0.1
# calculate free cashflow
cashflow <- income_statement
cashflow$operating_cf <- cashflow$earnings + cashflow$depreciation
cashflow$capex <- cashflow$machines_purchased * 160000000
cashflow$free_cf <- cashflow$operating_cf - cashflow$capex

# summarize free cashflow
sum(cashflow$free_cf)
## [1] -5.3e+08

Chapter 2 - Key Financial Concepts (Time is Money)

Time Value of Money:

  • Money today is generally more valuable than money tomorrow
    • Can use a compounding interest rate, so V(n) = V(0) * (1 + r)**n
    • Therefore, PV = FV(n) / (1+r)**n - the discounting rate is r, applied over n time periods
    • mutate(data, pv = fv / (1 + r)^n)

Different Discount Rates:

  • Need to clarify the types of discount rates - need to ensure the same time periods as the cash flows
  • Can convert discount rates between time periods using appropriate compounding
    • r2 = [(1 + r1)^(# r1 units per 1 r2 unit) ] - 1
    • r_quart <- (1 + r_mth)^3 - 1
    • r_quart <- (1 + r_ann)^(1/4) - 1
  • Need to also consider the differences in real and nominal rates
    • Can think about inflation or deflation as relates to purchaing power for an item
  • Generally easiest for cash flows to reflect real cash-flows and discounted by real interest rates
    • r_real=r_nominal / (1+inflation_rate)
    • r_nominal=r_real*(1+inflation_rate)

Discounting Multiple Cash Flows:

  • Cash flow differs dramatically based on time in the future when it is received
  • Need to discount the cash flows back to present values prior to summing across them
    • pv <- calc_pv(fv = 100, r = 0.01, n = 3) # single caseh flow
    • cashflows <- c(0, -50, 25, 100, 175, 250, 250)
    • discounted_cashflows <- calc_pv(cashflows, r = 0.01, n = 0:6) # vectorized calculation, once for each of 0:6
  • Vectorized functions work well with the Tidyverse
    • many_cashflows %>% group_by(option) %>% summarize( PV = sum(calc_pv(cashflow, 0.08, n = time))

Example code includes:

# Assign input variables
fv <- 100
r <- 0.08

# Calculate PV if receive FV in 1 year
pv_1 <- 100 / (1 + r)**1
pv_1
## [1] 92.59259
# Calculate PV if receive FV in 5 years
pv_5 <- 100 / (1 + r)**5
pv_5
## [1] 68.05832
# Calculate difference
pv_1 - pv_5
## [1] 24.53427
# Define PV function: calc_pv
calc_pv <- function(fv, r, n){
    pv <- fv / (1+r)**n
    pv
}

# Use PV function for 1 input
calc_pv(100, 0.08, 5)
## [1] 68.05832
# Use PV function for range of inputs
n_range <- 1:10
pv_range <- calc_pv(100, 0.08, n_range)
pv_range
##  [1] 92.59259 85.73388 79.38322 73.50299 68.05832 63.01696 58.34904
##  [8] 54.02689 50.02490 46.31935
# Calculate present values in dataframe
present_values <- data.frame(n = 1:10) %>% mutate(pv = 100 / (1 + 0.08)**n)

# Plot relationship between time periods versus present value
ggplot(present_values, aes(x = n, y = pv)) +
    geom_line() +
    geom_label(aes(label = paste0("$",round(pv,0)))) +
    ylim(0,100) +
    labs(
        title = "Discounted Value of $100 by Year Received", 
        x = "Number of Years in the Future",
        y = "Present Value ($)"
        )

# Calculate present values over range of time periods and discount rates
present_values <- 
    expand.grid(n = 1:10, r = seq(0.05,0.12,0.01)) %>%
    mutate(pv = calc_pv(100, r, n))
     
# Plot present value versus time delay with a separate colored line for each rate
ggplot(present_values, aes(x = n, y = pv, col = factor(r))) +
    geom_line() +
    ylim(0,100) +
    labs(
        title = "Discounted Value of $100 by Year Received", 
        x = "Number of Years in the Future",
        y = "Present Value ($)",
        col = "Discount Rate"
        )

# Convert monthly to other time periods
r1_mth <- 0.005
r1_quart <- (1 + r1_mth)^3 - 1
r1_semi <- (1 + r1_mth)^6 - 1
r1_ann <- (1 + r1_mth)^12 - 1

# Convert years to other time periods
r2_ann <- 0.08
r2_mth <- (1 + r2_ann)^(1/12) - 1
r2_quart <- (1 + r2_ann)^(1/4) - 1


# Convert real to nominal
r1_real <- 0.08
inflation1 <- 0.03
(r1_nom <- (1 + r1_real) * (1 + inflation1) - 1)
## [1] 0.1124
# Convert nominal to real
r2_nom <- 0.2
inflation2 <- 0.05
(r2_real <- (1 + r2_nom) / (1 + inflation2) - 1)
## [1] 0.1428571
# Define cashflows
cashflow_a <- c(5000, rep(0,6))
cashflow_b <- c(0, rep(1000, 6))

# Calculate pv for each time period
disc_cashflow_a <- calc_pv(cashflow_a, 0.06, 0:6)
disc_cashflow_b <- calc_pv(cashflow_b, 0.06, 0:6)

# Calculate and report total present value for each option
(pv_a <- sum(disc_cashflow_a))
## [1] 5000
(pv_b <- sum(disc_cashflow_b))
## [1] 4917.324
# Define cashflows
cashflow_old <- rep(-500, 11)
cashflow_new <- c(-2200, rep(-300, 10))
options <- data.frame(time = rep(0:10, 2),
                      option = c(rep("Old",11), rep("New",11)),
                      cashflow = c(cashflow_old, cashflow_new)
                      )
                
# Calculate total expenditure with and without discounting
options %>%
    group_by(option) %>%
    summarize(sum_cashflow = sum(cashflow),
              sum_disc_cashflow = sum(calc_pv(cashflow, 0.12, 0:10))
              )
## # A tibble: 2 x 3
##   option sum_cashflow sum_disc_cashflow
##   <fct>         <dbl>             <dbl>
## 1 New           -5200            -3895.
## 2 Old           -5500            -3325.

Chapter 3 - Prioritizing Profitability (Financial Metrics)

Profitability Metrics and Payback Period:

  • Profitability metrics help to quantify how projects bring value to firms
  • Decision rules help to interpret metrics for decisions and comparisons
    • Absolute vs. relative
    • Constrained vs. unconstrained
  • All summary metrics have shortcomings; none are perfect in all situations
    • Decision rules are communication tools, not hard and fast “laws”
    • There are many corporate objectives over and above discounted cash flows; important input but not the sole answer
  • Payback periods help to assess the amount of time needed before full payback - sooner is better
    • Does not consider TVM and does not consider profits after the payback period
    • cashflows <- c(-10000, 2500, 3000, 5000, 6000, 1000)
    • cumsum(cashflows) + init_investment

NPV, IRR, Profitability Index:

  • NPV - Net Present Value is net of the investment costs
    • In an unconstrained world, any NPV-positive project would be pursued
    • n <- 0:(length(cashflows) - 1)
    • npv <- sum( calc_pv(cashflows, r, n) )
  • IRR - Internal Rate of Return is the required rate of return for an investment to break-even over a stated time period
    • Also called the “hurdle rate” since it is the rate required for a project to hurdle in to profitability
    • assume we have calc_npv function with signature:

    • calc_npv(cashflows, r)

    • uniroot(calc_npv, interval = c(0, 1), cashflows = cashflows)$root
  • Profitability index - ratio between sum of discounted cash flows over the cost of the initial investment
    • A value of 1 is a break-even that only recoups the investment
    • npv_fcf <- calc_npv(future_cashflow, r)
    • profitability_index <- npv_fcf / abs(initial_investment)

Terminal Value:

  • Terminal Value (TV) is residual value outside the specified time period of the cash flow analysis
    • Can add TV as a summary metric for the cash flows (discounted) after the final time period
    • final_cashflow <- cashflow[n]
    • terminal_value_period_n <- final_cashflow / (discount_rate - growth_rate)
    • terminal_value_as_present <- terminal_value_period_n / (1 + discount_rate)^n
  • Can also use the exit multiplier method, in which a benchmark is applied to the cash flows (sales, revenues, etc.)

Comparing and Computing Metrics:

  • NPV and IRR are important metrics with related information - however, they may give different constrained project decisions
    • By definition, IRR is the discount rate that lets a project break-even on an NPV basis
  • NPV focuses on profit (favors large investments), while IRR focuses on return (favors small investments with high returns)
    • Best solution can be to examine both metrics
    • options %>% group_by(option) %>% summarize(npv=calc_npv(cf,0.08))

Wrap up:

  • Strengths, weaknesses, and blind spots of various metrics
    • Summary metrics alone can be deceptive
  • Additional metrics include ROE (equity) and ROA (assets)
  • Industries and even companies may have their own key metrics of interest

Example code includes:

cashflows <- c(-50000, 1000, 5000, 5000, 5000, 10000, 10000, 10000, 10000, 10000, 10000)

# Inspect variables
cashflows
##  [1] -50000   1000   5000   5000   5000  10000  10000  10000  10000  10000
## [11]  10000
# Calculate cumulative cashflows
cum_cashflows <- cumsum(cashflows)

# Identify payback period
payback_period <- min(which(cum_cashflows >= 0)) - 1

# View result
payback_period
## [1] 8
# Define payback function: calc_payback
calc_payback <- function(cashflows) {
  cum_cashflows <- cumsum(cashflows)
  payback_period <- min(which(cum_cashflows >= 0)) - 1
  payback_period
}

# Test out our function
cashflows <- c(-100, 50, 50, 50)
calc_payback(cashflows) == 2
## [1] TRUE
cashflows <- c(-50000, 1000, 5000, 5000, 5000, 10000, 10000, 10000, 10000, 10000, 10000)

# normal payback period
payback_period <- calc_payback(cashflows)

# discounted payback period
discounted_cashflows <- calc_pv(cashflows, r = 0.06, n = 0:(length(cashflows)-1) )
payback_period_disc <- calc_payback(discounted_cashflows)

# compare results
payback_period
## [1] 8
payback_period_disc
## [1] 10
# Define NPV function: calc_npv
calc_npv <- function(cashflows, r) {
  n <- 0:(length(cashflows) - 1)
  npv <- sum( calc_pv(cashflows, r, n) )
  npv
}


# The base R function stats::uniroot can help find values between a lower bound (lower) and an upper bound (upper) where the value of a function is zero
# This can help us calculate the internal rate of return (IRR) for which NPV = 0.

# Define IRR function: calc_irr
calc_irr <- function(cashflows) {
    uniroot(calc_npv, 
        interval = c(0, 1), 
        cashflows = cashflows)$root
}

# Try out function on valid input
cashflows <- c(-100, 20, 20, 20, 20, 20, 20, 10, 5)
calc_irr(cashflows)
## [1] 0.08296991
# Define profitability index function: calc_profitability_index
calc_profitability_index <- function(init_investment, future_cashflows, r) {
    discounted_future_cashflows <- calc_npv(future_cashflows, r)
    discounted_future_cashflows / abs(init_investment)
}

# Try out function on valid input
init_investment <- -100
cashflows <- c(0, 20, 20, 20, 20, 20, 20, 10, 5)
calc_profitability_index(init_investment, cashflows, 0.08)
## [1] 1.009938
# pull last year cashflow from vector of cashflows
last_year_cashflow <- cashflow[length(cashflow)]
last_period_n <- length(cashflow) - 1

# calculate terminal value for different discount raes
terminal_value_1 <- last_year_cashflow / ((0.15 - 0.1)*(1 + 0.15)^last_period_n)
terminal_value_2 <- last_year_cashflow / ((0.15 - 0.01)*(1 + 0.15)^last_period_n)
terminal_value_3 <- last_year_cashflow / ((0.15 + 0.05)*(1 + 0.15)^last_period_n)

# inspect results
terminal_value_1 
##       free_cf
## 1  -719183902
## 2  -392282129
## 3  -784564257
## 4 -1569128514
terminal_value_2
##      free_cf
## 1 -256851394
## 2 -140100760
## 3 -280201520
## 4 -560403041
terminal_value_3
##      free_cf
## 1 -179795976
## 2  -98070532
## 3 -196141064
## 4 -392282129
cashflow1 <- c(-50000, 100, 2000, 2000, 5000, 10000, 10000, 10000, 10000, 10000, 10000)
cashflow2 <- c(-1e+05, 20000, 20000, 20000, 20000, 20000)
cashflow3 <- c(-8000, 6000, 5000, 4000, 3000, 2000, 1000, 0)

# calculate internal rate of return (IRR) for each stream of cashflows
r1 <- calc_irr(cashflow1)
r2 <- calc_irr(cashflow2)
r3 <- calc_irr(cashflow3)

# calculate net present value (NPV) for each stream of cashflows, assuming r = irr
npv1 <- calc_npv(cashflow1, r1)
npv2 <- calc_npv(cashflow2, r2)
npv3 <- calc_npv(cashflow3, r3)

# examine results
npv1
## [1] -5.859804
npv2
## [1] 0
npv3
## [1] -0.1359058
cf1 <- c(-5000, 450, 450, 450, 450, 450, 450, 450, 450, 450, 450)
cf2 <- c(-5000, 2000, 2000, 2000, 2000, 2000, 2000, -2000, -2000, -2000, -2000)
rates <- c(0, 0.005, 0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045, 0.05, 0.055, 0.06, 0.065, 0.07, 0.075, 0.08, 0.085, 0.09, 0.095, 0.1, 0.105, 0.11, 0.115, 0.12, 0.125, 0.13, 0.135, 0.14, 0.145, 0.15, 0.155, 0.16, 0.165, 0.17, 0.175, 0.18, 0.185, 0.19, 0.195, 0.2, 0.205, 0.21, 0.215, 0.22, 0.225, 0.23, 0.235, 0.24, 0.245, 0.25)

# create dataset of NPV for each cashflow and rate
npv_by_rates <- data.frame(rates) %>%
    group_by(rates) %>%
    mutate(npv1 = calc_npv(cf1, rates), npv2 = calc_npv(cf2, rates))
   
# plot cashflows over different discount rates     
ggplot(npv_by_rates, aes(x = rates, y = npv1)) + 
    geom_line() +
    geom_line(aes(y = npv2)) +
    labs(title = "NPV by Discount Rate", subtitle = "A Tale of Two Troubling Cashflows",
         y = "NPV ($)",x = "Discount Rate (%)"
         ) +
    annotate("text", x = 0.2, y = -500, label = "Two break-even points") +
    annotate("text", x = 0.2, y = -2500, label = "No break-even point")

cashflows <- data.frame(option=rep(1:4, each=11), time=rep(0:10, times=4), 
                        cashflow=c(-10, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, -1000, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, -1e+05, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, -10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
                        )

# calculate summary metrics
cashflow_comparison <-
  cashflows %>%
  group_by(option) %>%
  summarize(npv = calc_npv(cashflow, 0.1), irr = calc_irr(cashflow))
      
# inspect output
cashflow_comparison
## # A tibble: 4 x 3
##   option      npv   irr
##    <int>    <dbl> <dbl>
## 1      1    14.6  0.385
## 2      2   843.   0.273
## 3      3 22891.   0.151
## 4      4    -3.86 0
# visualize summary metrics
ggplot(cashflow_comparison, aes(x = npv, y = irr, col = factor(option))) +
    geom_point(size = 5) +
    geom_hline(yintercept = 0.1) +
    scale_y_continuous(label = scales::percent) +
    scale_x_continuous(label = scales::dollar) +
    labs(title = "NPV versus IRR for Project Alternatives",
         subtitle = "NPV calculation assumes 10% discount rate",
         caption = "Line shows actual discount rate to asses IRR break-even",
         x = "NPV ($)", y = "IRR (%)",col = "Option"
         )


Chapter 4 - Understanding Outcomes

Building a Business Case:

  • Can combine metrics to evaluate potential business models
  • Example of working for a data-driven coffee shop - move from coffee to nitro-brew
    • Consider only the incremental costs; for example rent and sunk costs (prior to the project) should not be considered
    • May want to consider side-effects such as self-cannibalization
  • Business modeling is part art and part science; need a good blend

Scenario Analysis with tidyr and purrr:

  • May want to consider additional projects, and there may not be funding for pursuing all of these ideas
  • May be some variability in the accuracy of our estimates - “what if analysis”, aka “scenario analysis”
    • scenario1 <- mutate(assumptions, var1 = 1.2 * var1)
    • cashflow1 <- calc_model(scenario1)
    • calc_npv(cashflow1)
  • Can use tidyr and purr to automate some of this
    • all_scenarios %>% nest(-scenario)
    • all_scenarios %>% nest(-scenario) %>% mutate( cashflow = map_df( data, calc_model) ) # data frame in column ‘data’ will be passed to calc_model
    • all_scenarios %>% nest(-scenario) %>% mutate( cashflow = map_df( data, calc_model) ) %>% mutate( npv = map_dbl( cashflow, calc_npv) )

Sensitivity analysis:

  • Sensitivity analysis is an approach to uncertainty that assesses the impact of each base assumption on the outcomes
    • sensitivity <- expand.grid( factor = c(0.5, 1, 1.5), metric = c(“vbl1”, “vbl2”) )
    • sensitivity <- expand.grid( factor = c(0.5, 1, 1.5), metric = c(“vbl1”, “vbl2”) ) %>% mutate(scenario = map2(metric, factor, ~factor_data(assumptions, .x, .y))) # The ~ is for the anaonymous function
  • Can then visualize the impact of the sensitivity analysis; change in overall NPV vs. % change in base metric
  • Caution that errors are often correlated - this is just a univariate sensitivity analysis
  • Caution that not all metrics make the same magnitudes of change; analysis could be misinterpreted as a result

Communicating Cashflow Concepts:

  • Long data is tidy - one column per metric and one row per observation (time)
  • Financial stakeholders will be more accustomed to non-tidy data
    • long_cashflow <- gather(cashflow, key = Month, value = Value, -Metric) # make the data long with key column Month and values in Value and Metric left as a column
    • tidy_cashflow <- spread(long_cashflow, key = Metric, value = Value, -Metric)
  • Waterfall diagrams may be useful for communicating outcomes
    • ggplot(data) + geom_rect( aes( xmin = , xmax = , ymin = , ymax = ) )
    • ggplot(waterfall_data, aes( xmin = rn - 0.25, xmax = rn + 0.25, ymin = start, ymax = end) ) + geom_rect() + scale_x_continuous( breaks = waterfall_data\(rn, labels = waterfall_data\)category )

Advanced Topics in Cashflow Modeling:

  • Many additional topics can be covered in financial modeling
    • Debt vs. equity financing
    • Decisions made today change the options available to us in the future
    • Probabilistic simulation (impacts of uncertainty)

Example code includes:

assumptions <- data.frame(year=0:10, 
                          unit_sales_per_day=c(0, 10, 12, 14, 15, 16, 17, 18, 18, 18, 18),
                          capex=c(5000, rep(0, 10)),
                          pct_cannibalization=c(0, rep(0.25, 10)), 
                          maintenance_cost=c(0, rep(250, 10)), 
                          depreciation_cost=c(0, rep(500, 10)), 
                          profit_margin_per_nitro=3, 
                          profit_margin_per_regular=1, 
                          labor_cost_per_hour=8, 
                          days_open_per_year=250
                          )

# Check the first few rows of the data
head(assumptions)
##   year unit_sales_per_day capex pct_cannibalization maintenance_cost
## 1    0                  0  5000                0.00                0
## 2    1                 10     0                0.25              250
## 3    2                 12     0                0.25              250
## 4    3                 14     0                0.25              250
## 5    4                 15     0                0.25              250
## 6    5                 16     0                0.25              250
##   depreciation_cost profit_margin_per_nitro profit_margin_per_regular
## 1                 0                       3                         1
## 2               500                       3                         1
## 3               500                       3                         1
## 4               500                       3                         1
## 5               500                       3                         1
## 6               500                       3                         1
##   labor_cost_per_hour days_open_per_year
## 1                   8                250
## 2                   8                250
## 3                   8                250
## 4                   8                250
## 5                   8                250
## 6                   8                250
# Check the variable names of the data
names(assumptions)
##  [1] "year"                      "unit_sales_per_day"       
##  [3] "capex"                     "pct_cannibalization"      
##  [5] "maintenance_cost"          "depreciation_cost"        
##  [7] "profit_margin_per_nitro"   "profit_margin_per_regular"
##  [9] "labor_cost_per_hour"       "days_open_per_year"
# Plot the trend of unit_sales_per_day by year
ggplot(assumptions, aes(x = year, y = unit_sales_per_day)) + 
    geom_line()

tax_rate <- 0.36

# Create the cashflow_statement dataframe
cashflow_statement <-
  mutate(assumptions,
    sales_per_year = unit_sales_per_day * days_open_per_year,
    sales_revenue = sales_per_year * profit_margin_per_nitro,
    total_revenue = sales_revenue,
    labor_cost = days_open_per_year * 0.5 * labor_cost_per_hour,
    cannibalization_cost = sales_per_year * pct_cannibalization * profit_margin_per_regular,
    direct_expense = labor_cost + cannibalization_cost + maintenance_cost,
    gross_profit = total_revenue - direct_expense,
    operating_income = gross_profit - depreciation_cost,
    net_income = operating_income * (1 - tax_rate), 
    cashflow = net_income + depreciation_cost - capex    
  )


# build individual scenarios
optimist <- mutate(assumptions, unit_sales_per_day = unit_sales_per_day * 1.2, pct_cannibalization = 0.1)
pessimist <- mutate(assumptions, unit_sales_per_day = unit_sales_per_day * 0.8, profit_margin_per_nitro = 1)

# combine into one dataset
scenarios <-
  bind_rows(
    mutate(pessimist, scenario = "pessimist"),
    mutate(assumptions, scenario = "realist"),
    mutate(optimist, scenario = "optimist")
  )


calc_model <- function(assumptions){
  mutate( assumptions,
    sales_per_year = unit_sales_per_day * days_open_per_year,
    sales_revenue = sales_per_year * profit_margin_per_nitro,
    total_revenue = sales_revenue,
    labor_cost = days_open_per_year * 0.5 * labor_cost_per_hour,
    cannibalization_cost = sales_per_year * pct_cannibalization * profit_margin_per_regular,
    direct_expense = labor_cost + cannibalization_cost + maintenance_cost,
    gross_profit = total_revenue - direct_expense,
    operating_income = gross_profit - depreciation_cost,
    net_income = operating_income * (1 - 0.36), 
    cashflow = net_income + depreciation_cost - capex    
  )
}

calc_npv_from_cashflow <- function(cashflow, r){
  cashflow_line <- cashflow$cashflow
  sum(calc_pv(cashflow_line, r, 0:(length(cashflow_line)-1)))
}

# calculate scenario NPVs
scenario_analysis <- scenarios %>%
    nest(-scenario) %>%
    mutate(cashflow = map(data, calc_model)) %>%
    mutate(npv = map_dbl(cashflow, calc_npv_from_cashflow, 0.2))

# inspect results
select(scenario_analysis, scenario, npv)
## # A tibble: 3 x 2
##   scenario     npv
##   <chr>      <dbl>
## 1 pessimist -2505.
## 2 realist   18042.
## 3 optimist  25019.
# scenario analysis bar chart
ggplot(data = scenario_analysis, aes(x = scenario, y = npv, fill = scenario)) + 
    geom_bar(stat = "identity") +
    scale_y_continuous(labels = scales::dollar) +
    labs(title = "NPV Scenario Analysis of Nitro Coffee Expansion") +
    guides(fill = FALSE)

# define sensitivity factor function
factor_data <- function(data, metric, factor){
  data[[metric]] <- data[[metric]] * factor
  data
}

# create sensitivity analysis
sensitivity <-
  expand.grid(
    factor = seq(0.5,1.5,0.1), 
    metric = c("profit_margin_per_nitro", "labor_cost_per_hour", "pct_cannibalization", "unit_sales_per_day")) %>%
  mutate(scenario = map2(metric, factor, ~factor_data(assumptions, .x, .y))) %>%
  mutate(cashflow = map(scenario, calc_model)) %>% 
  mutate(npv = map_dbl(cashflow, calc_npv_from_cashflow, r = 0.2))


ggplot(sensitivity,
       aes(x = factor, y = npv, col = metric)
       ) +
  geom_line() +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::dollar) +
  labs(
    title = "Sensivity Analysis",
    x = "Factor on Original Assumption",
    y = "Projected NPV",
    col = "Metric"
  )

# examine current cashflow strucutre
tidy_cashflow <- data.frame(Month=1:6, 
                            Received=c(100, 200, 300, 400, 500, 500), 
                            Spent=c(150, 175, 200, 225, 250, 250)
                            )

# create long_cashflow with gather
# long_cashflow <- tidyr::gather(cashflow, key = Month, value = Value, -Metric)

# create tidy_cashflow with spread
# tidy_cashflow <- tidyr::spread(long_cashflow, key = Metric, value = Value)

# examine results
tidy_cashflow
##   Month Received Spent
## 1     1      100   150
## 2     2      200   175
## 3     3      300   200
## 4     4      400   225
## 5     5      500   250
## 6     6      500   250
# create long_cashflow with gather
long_cashflow <- tidyr::gather(tidy_cashflow, key = Metric, value = Value, -Month)

# create untidy_cashflow with spread
untidy_cashflow <- tidyr::spread(long_cashflow, key = Month, value = Value)

# examine results
untidy_cashflow
##     Metric   1   2   3   4   5   6
## 1 Received 100 200 300 400 500 500
## 2    Spent 150 175 200 225 250 250
gross_profit_summary <- data.frame(metric=c("Sales Revenue", "Keg Cost", "Cannibalization Cost", "Labor Cost", "Maintenance Cost"), 
                                   value=c(187200, -78240, -31200, -10000, -2500)
                                   )

# compute min and maxes for each line item
waterfall_items <-
  mutate(gross_profit_summary,
         end = cumsum(value), 
         start = lag(cumsum(value), 1, default = 0))

# compute totals row for waterfall metrics
waterfall_summary <- 
  data.frame(metric = "Gross Profit", 
             end = sum(gross_profit_summary$value), 
             start = 0)

# combine line items with summary row
waterfall_data <-
  bind_rows(waterfall_items, waterfall_summary) %>%
  mutate(row_num = row_number())
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
# Plot waterfall diagram
ggplot(waterfall_data, aes(fill = (end > start))) +
  geom_rect(aes(xmin = row_num - 0.25, xmax = row_num + 0.25, 
                ymin = start, ymax = end)) +
  geom_hline(yintercept = 0) +
  scale_x_continuous(breaks = waterfall_data$row_num, labels = waterfall_data$metric) +
  # Styling provided for you - check out a ggplot course for more information!
  scale_y_continuous(labels = scales::dollar) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x = element_blank()) +
  guides(fill = FALSE) +
  labs(
      title = "Gross Profit for Proposed Nitro Coffee Expansion",
      subtitle = "Based on pro forma 10-year forecast")


Visualizing Big Data with Trelliscope

Chapter 1 - General Strategies for Visualizing Big Data

Visualizing summaries:

  • Start with summarization plots for large datasets
  • Univariate variable descriptions from the gapminder dataset
    • Continuous - histograms
    • ggplot(gapminder, aes(lifeExp)) + geom_histogram()
    • Discrete - bar chart for counts
    • ggplot(gapminder, aes(continent)) + geom_bar()
    • Temporal - bin by time, compute number of observations
    • by_year <- gapminder %>% group_by(year) %>% summarise(medianGdpPercap = median(gdpPercap, na.rm = TRUE))
    • ggplot(by_year, aes(year, medianGdpPercap)) + geom_line()
  • Binning is a common strategy for summarizing large datasets - particularly the dplyr groupby function
  • Can extend to the NYC taxi cab dataset - millions of records on the number of taxi trips taken in NYC

Adding more detail to summaries:

  • Can bin pairs of variables in two dimensions, and facet vary on additional variables
  • The geom_hex() can be a useful means for showing the 2x2 plot
    • ggplot(tx, aes(tip_amount, total_amount)) + geom_hex(bins = 75) + scale_x_log10() + scale_y_log10() + geom_abline(slope = 1, intercept = 0)
  • Can use the facet_wrap to see visual subsets of the data - all elements of the main call are added to each of the facets automatically
    • ggplot(daily_count, aes(pickup_date, n_rides)) + geom_point() + facet_grid(~ pickup_dow)

Visualizing subsets:

  • May need to look in more detail than what is available in the summaries
  • A useful technique is to take a natural subset of a large dataset; for example, all of the price data for a single stock
  • Example of examining the “zero tip” nature of the cash rides shown in the NYC taxi dataset; subset to 5,000 rides from UES South to UES North
    • ggplot(tx_pop, aes(trip_duration, total_amount)) + geom_point(alpha = 0.2)
    • ggplot(tx_pop, aes(sample = total_amount, color = payment_type)) + geom_qq(distribution = stats::qunif) + ylim(c(3, 20)) # quantile plot against the uniform distribution

Visualizing all subsets:

  • May want to extend the analysis to all of the routes in the taxi dataset
  • Faceting will not work with over 20k total panels as in the taxi dataset
  • Refined approaches using trelliscope help enable these higher-volume visualizations

Example code includes:

load("./RInputFiles/tx_sub.RData")
glimpse(tx)
## Observations: 1,000,000
## Variables: 7
## $ pick_day      <date> 2016-07-09, 2016-07-28, 2016-07-20, 2016-07-30,...
## $ pick_dow      <fct> Sat, Thu, Wed, Sat, Tue, Thu, Fri, Sun, Mon, Thu...
## $ total_amount  <dbl> 47.60, 9.96, 6.80, 11.75, 7.30, 12.05, 13.80, 14...
## $ tip_amount    <dbl> 23.80, 1.66, 1.00, 1.95, 0.00, 2.75, 0.00, 2.36,...
## $ payment_type  <fct> Card, Card, Card, Card, Cash, Card, Cash, Card, ...
## $ trip_duration <dbl> 26.116667, 5.866667, 4.916667, 10.350000, 6.8666...
## $ pick_wkday    <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALS...
tx <- tx %>%
    rename(pickup_date=pick_day, pickup_dow=pick_dow)
glimpse(tx)
## Observations: 1,000,000
## Variables: 7
## $ pickup_date   <date> 2016-07-09, 2016-07-28, 2016-07-20, 2016-07-30,...
## $ pickup_dow    <fct> Sat, Thu, Wed, Sat, Tue, Thu, Fri, Sun, Mon, Thu...
## $ total_amount  <dbl> 47.60, 9.96, 6.80, 11.75, 7.30, 12.05, 13.80, 14...
## $ tip_amount    <dbl> 23.80, 1.66, 1.00, 1.95, 0.00, 2.75, 0.00, 2.36,...
## $ payment_type  <fct> Card, Card, Card, Card, Cash, Card, Cash, Card, ...
## $ trip_duration <dbl> 26.116667, 5.866667, 4.916667, 10.350000, 6.8666...
## $ pick_wkday    <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALS...
# Summarize taxi ride count by pickup day
daily_count <- tx %>%
  group_by(pickup_date) %>%
  summarise(n_rides=n())

# Create a line plot
ggplot(daily_count, aes(x=pickup_date, y=n_rides)) +
  geom_line()

# Create a histogram of total_amount
ggplot(tx, aes(x=total_amount)) +
  geom_histogram() +
  scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 58 rows containing non-finite values (stat_bin).

# Create a bar chart of payment_type
ggplot(tx, aes(x=payment_type)) +
  geom_bar()

# Create a hexagon-binned plot of total_amount vs. trip_duration
ggplot(tx, aes(x=trip_duration, y=total_amount)) +
  geom_hex(bins=75) +
  scale_x_log10() +
  scale_y_log10()
## Warning in self$trans$transform(x): NaNs produced

## Warning in self$trans$transform(x): Transformation introduced infinite
## values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1058 rows containing non-finite values (stat_binhex).

# Summarize taxi rides count by payment type, pickup date, pickup day of week, and payment type
daily_count <- tx %>%
  filter(payment_type %in% c("Card", "Cash")) %>%
  group_by(payment_type, pickup_date, pickup_dow) %>%
  summarise(n_rides=n())

# Plot the data
ggplot(daily_count, aes(x=pickup_date, y=n_rides)) +
  geom_point() +
  facet_grid(payment_type ~ pickup_dow) +
  coord_fixed(ratio = 0.4)

# Histogram of the tip amount faceted on payment type
ggplot(tx, aes(x=tip_amount+0.01)) +
  geom_histogram() +
  scale_x_log10() +
  facet_wrap(~ payment_type, ncol=1, scales="free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Get data ready to plot
amount_compare <- tx %>%
    mutate(total_no_tip = total_amount - tip_amount) %>%
    filter(total_no_tip <= 20) %>%
    sample_n(size=round(nrow(.)/20), replace=FALSE) %>%
    select(total_amount, total_no_tip, payment_type) %>%
    tidyr::gather(amount_type, amount, -payment_type)

# Quantile plot
ggplot(amount_compare, aes(sample=amount, color=payment_type)) +
  geom_qq(distribution=stats::qunif, shape = 21) +
  facet_wrap(~ amount_type) +
  ylim(c(3, 20))
## Warning: Removed 2202 rows containing missing values (geom_point).


Chapter 2 - ggplot2 + Trelliscope JS

Faceting with Trelliscope JS:

  • The trelliscope is a powerful tool for viewing moderate and large datasets - will again explore gapminder as a starting point
    • ggplot(gapminder, aes(year, lifeExp, group = country)) + geom_line() # enormouse over-plotting problem
    • ggplot(gapminder, aes(year, lifeExp, group = country, color = continent)) + geom_line() + facet_wrap(~ continent, nrow = 1) + guides(color = FALSE) # somewhat better to see insights by continent
    • ggplot(gapminder, aes(year, lifeExp)) + geom_line() + facet_wrap(~ country + continent) # labels overplot the data (impossible to view)
  • The trelliscope is just a replacement of a single call in the ggplot
    • It’s as easy as swapping out facet_wrap() for facet_trelliscope()
    • As with facet_wrap(), control rows and columns with nrow and ncol

Interacting with Trelliscope JS displays:

  • Can look at just one panel as per the JS user controls; focus in on the most interesting areas of a visualization
  • Can page through the panels using the Prev and Next panels and/or the arrow buttons
  • Can customize the grid layout by clicking on the “Grid” widget on the side of the plot
  • The “cognostics” are also clickable as far as which labels will be shown
    • Can subset by selecting portions of a histogram and/or categories for viewing
    • Default sorting order are low-to-high, though this can be customized

Additional Trelliscope JS features:

  • Can use plotly for additional interactivity within panels - adds tooltips and pop-ups
    • gap_life <- select(gapminder, year, lifeExp, country, continent)
    • ggplot(gap_life, aes(year, lifeExp)) + geom_point() + facet_trelliscope(~ country + continent, name = “lifeExp_by_country”, desc = “Life expectancy vs. year for 142 countries.”, nrow = 2, ncol = 3, as_plotly = TRUE)
  • Can calculate cognostics automatically for additional useful metrics
    • ggplot(gap_life, aes(year, lifeExp)) + geom_point() + facet_trelliscope(~ country + continent, name = “lifeExp_by_country”, desc = “Life expectancy vs. year for 142 countries.”, nrow = 2, ncol = 3, auto_cog = TRUE)
  • Multi-panel displays require good usage of axis limits
    • Similar to the scales argument in facet_wrap
    • Default behavior is scales = “same”
    • When scales = “sliced”, the scales have the same ranges but with potentially different starting points (these are currently only in trelliscopejs rather than in ggplot2)
    • When scales = “free”, the scales are independent by facet

Adding your own cognostics:

  • Can add new variables as cognostics by adding new variables to the data
  • Example of creating the latest observed life expectancy
    • gap <- gapminder %>% group_by(country) %>% mutate(latestLifeExp = tail(lifeExp, 1)) # do not summarize; maintain the original data
  • Can create links within the displays; for example
  • Customizing custom cognostics
    • A function cog() can be wrapped around a variable to fine-tune how a cognostic is handled in Trelliscope
    • desc: a meaningful description for the cognostic
    • default_label: boolean specifying whether the cognostic should be shown as a label by default or not

Example code includes:

library(trelliscopejs)

data(gapminder, package="gapminder")
glimpse(gapminder)


# Create the plot
gapminder %>%
    filter(continent=="Oceania") %>%
    ggplot(aes(x=year, y=lifeExp)) +
    geom_line() +
    # Facet on country and continent
    facet_trelliscope(~ country + continent, data=gapminder)

gapminder %>%
    filter(continent=="Oceania") %>%
    ggplot(aes(x=year, y=lifeExp)) +
    geom_line() +
    facet_trelliscope(~ country + continent, name = "lifeExp_by_country",
                      desc = "Life expectancy vs. year per country", nrow = 1, ncol = 2
                      )


# Create the plot
gapminder %>%
    filter(continent=="Oceania") %>%
    ggplot(aes(x=year, y=lifeExp)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    facet_trelliscope(~ country + continent, name = "lifeExp_by_country",
                      desc = "Life expectancy vs. year for 142 countries.",
                      nrow = 2, ncol = 3,
                      # Set the scales
                      scales="sliced",
                      # Specify automatic cognistics
                      auto_cog=TRUE
                      )


# Group by country and create the two new variables
gap <- gapminder %>%
    filter(continent=="Oceania") %>%
    group_by(country) %>%
    mutate(delta_lifeExp = tail(lifeExp, 1) - head(lifeExp, 1),
           ihme_link = paste0("http://www.healthdata.org/", country)
           )

# Add the description
gap$delta_lifeExp <- cog(gap$delta_lifeExp, desc = "Overall change in life expectancy")

# Specify the default label
gap$ihme_link <- cog(gap$ihme_link, default_label = TRUE)

ggplot(gap, aes(year, lifeExp)) +
    geom_point() +
    facet_trelliscope(~ country + continent,
                      name = "lifeExp_by_country",
                      desc = "Life expectancy vs. year for 142 countries.",
                      nrow = 2, ncol = 3, scales = c("same", "sliced")
                      )

Chapter 3 - Trelliscope in the Tidyverse

Trelliscope in the Tidyverse:

  • Stock market dataset called “stocks” has price data for the 500 highest-traded NASDAQ stocks
  • Can use ggplot2 to understand stock price changes by year
  • Might instead want to use the canbdlestick chart from plotly - can also zoom, plot, and hover based on the plotly characteristics
    • candlestick_plot <- function(d) plot_ly(d, x = ~date, type = “candlestick”, open = ~open, close = ~close, high = ~high, low = ~low)
    • candlestick_plot(filter(stocks, symbol == “AAPL”))
  • Can use nesting to make for a smaller data frame with column data holding a tibble
    • by_symbol <- stocks %>% group_by(symbol) %>% nest()
    • by_symbol <- mutate(by_symbol, last_close = map_dbl(data, function(x) tail(x$close, 1)))
    • by_symbol <- mutate(by_symbol, panel = map_plot(data, candlestick_plot)) # map_plot is from the trelliscope package; one plot for each stock
    • trelliscope(by_symbol, name = “candlestick_top500”, nrow = 2, ncol = 3) # show the plot with the given initial layout

Cognostics:

  • Can add variables, which will then be included as cognostics in the trelliscope
  • Can add metadata to the stocks database, then use nest to have the metadata as a new variable
    • by_symbol <- left_join(by_symbol, stocks_meta)
    • by_symbol <- mutate(by_symbol, volume_stats = map(data, function(x) { data_frame( min_volume = min(x\(volume), max_volume = max(x\)volume) ) }))

Trelliscope options:

  • Can further customize multiple options related to a trelliscope plot
  • Can specify the output directory for sharing the display with others - default is just a temporary directory, and will over-write anything in the directory with the same name
    • trelliscope(dat, path = “…”, …)
    • ggplot(…) + … + facet_trelliscope(path = “…”, …)
  • May want to provide more detailed descriptions of the data - shows in the viewer icon
    • trelliscope(by_symbol, name = “candlestick_top500”, desc = “Candlestick plot of the 500 most-traded NASDAQ stocks in 2016”, md_desc = " ## Candlestick Plot A candlestick plot is a financial plot… … “) # multi-line markdown in md_desc
  • Plot aspect ratio is important for strong visual displays - default is a square of 500px x 500px
    • trelliscope(dat, width = 600, height = 300, …) # specified in units of pixels
    • ggplot(…) + … + facet_trelliscope(width = 600, height = 300, …)
  • Default state can be further specified - interface is currently undergoing active improvement
    • trelliscope(dat, state = …, …)
    • ggplot(…) + … + facet_trelliscope(state = …, …)

Visualizing databases of images:

  • May want to view large collections of images, which is a strong use case for trelliscope - example of the pokemon dataset
    • select(pokemon, url_image)
    • pokemon <- mutate(pokemon, panel = img_panel(url_image))
  • May have local images for comparisons instead
    • path <- file.path(tempdir(), “pokemon_local”)
    • dir.create(path)
    • for (url in pokemon$url_image)
    • download.file(url, destfile = file.path(path, basename(url)))
    • pokemon\(image <- basename(pokemon\)url_image)
    • pokemon <- mutate(pokemon, panel = img_panel_local(image))
    • trelliscope(pokemon, name = “pokemon”, nrow = 3, ncol = 6, path = path)

Example code includes:

# do not have dataset 'stocks'
by_symbol <- stocks %>%
  group_by(symbol) %>%
  nest()

min_volume_fn <- function(x) min(x$volume)

# Create new column
by_symbol_min <- by_symbol %>%
  mutate(min_volume = map_dbl(data, min_volume_fn))


ohlc_plot <- function(d) {
  plot_ly(d, x = ~date, type = "ohlc",
    open = ~open, close = ~close,
    high = ~high, low = ~low)
}

by_symbol_plot <- mutate(by_symbol, panel=map_plot(data, ohlc_plot))

trelliscope(by_symbol_plot, name="ohlc_top500")


# Create market_cap_log
by_symbol <- mutate(by_symbol,
  market_cap_log = cog(
    val = log10(market_cap),
    desc = "log base 10 market capitalization"
  )
)


annual_return <- function(x)
  100 * (tail(x$close, 1) - head(x$open, 1)) / head(x$open, 1)

# Compute by_symbol_avg
by_symbol_avg <- mutate(by_symbol,
  stats = map(data, function(x) {
    data_frame(
      mean_close = mean(x$close),
      mean_volume = mean(x$volume),
      annual_return = annual_return(x)
      )
    }
  )
)


# Create the trelliscope display
p <- trelliscope(by_symbol, width=600, height=300, name="ohlc_top500", desc="Example aspect 2 plot")


pokemon %>%
  # Reduce the variables in the dataset
  select(pokemon, type_1, attack, generation_id, url_image) %>%
  mutate(
    # Respecify pokemon
    pokemon = cog(pokemon, default_label=TRUE),
    # Create panel variable
    panel = img_panel(url_image)
  ) %>%
  # Create the display
  trelliscope(name="pokemon", nrow=3, ncol=6)

Chapter 4 - Case Study: Exploring Montreal BIXI Bike Data

Montreal BIXI Bike Data:

  • Bike riding data for 2017 for BIXI Montreal - desire to understand patterns of usages
  • Aggregations and summary plots for the big picture, followed by deep-dives with trelliscope
  • Randomly sub-samples down to 1,000,000 records for the examples

Summary Visualization Recap:

  • Clear differences in weekday usage (peaks at 08h00 and 17h00) and weekend usage (peak around 15h00)
  • Non-members tend to ride as much on weekends (over 2 days) rather than weekdays (over 5 days)
    • Hypothesis that members are more likely to be commuters and non-members more likely to be tourists

Top 100 routes dataset:

  • May want to examine route-specific behaviors
  • Considering a route as a from-to, we want to look at the top-100 routes, ignoring round-trip routes
    • route_tab <- bike %>% filter(start_station_code != end_station_code) %>% group_by(start_station_code, end_station_code) %>% summarise(n = n()) %>% arrange(-n)
    • top_routes <- paste( route_tab\(start_station_code[1:100], route_tab\)end_station_code[1:100])
    • top100 <- bike %>% filter(paste(start_station_code, end_station_code) %in% top_routes)
  • Preparing data for visualization
    • route_hod <- top100 %>% group_by(start_station_code, end_station_code, start_hod, weekday) %>% summarise(n = n())
    • route_hod <- route_hod %>% left_join(start_stations) %>% left_join(end_stations)

Wrap up:

  • Can filter and sort to find key information in the trelliscope facets
  • Can further investigate routes with rush hour peaks (AM, PM, or both) as well as routes with differences in weekend and weekday

Example code includes:

# DO NOT HAVE FULL BIXI Data
bike04 <- read_csv("./RInputFiles/BIXIData/OD_2017-04.csv")
## Parsed with column specification:
## cols(
##   start_date = col_datetime(format = ""),
##   start_station_code = col_integer(),
##   end_date = col_datetime(format = ""),
##   end_station_code = col_integer(),
##   duration_sec = col_integer(),
##   is_member = col_integer()
## )
bike05 <- read_csv("./RInputFiles/BIXIData/OD_2017-05.csv")
## Parsed with column specification:
## cols(
##   start_date = col_datetime(format = ""),
##   start_station_code = col_integer(),
##   end_date = col_datetime(format = ""),
##   end_station_code = col_integer(),
##   duration_sec = col_integer(),
##   is_member = col_integer()
## )
bike06 <- read_csv("./RInputFiles/BIXIData/OD_2017-06.csv")
## Parsed with column specification:
## cols(
##   start_date = col_datetime(format = ""),
##   start_station_code = col_integer(),
##   end_date = col_datetime(format = ""),
##   end_station_code = col_integer(),
##   duration_sec = col_integer(),
##   is_member = col_integer()
## )
bike07 <- read_csv("./RInputFiles/BIXIData/OD_2017-07.csv")
## Parsed with column specification:
## cols(
##   start_date = col_datetime(format = ""),
##   start_station_code = col_integer(),
##   end_date = col_datetime(format = ""),
##   end_station_code = col_integer(),
##   duration_sec = col_integer(),
##   is_member = col_integer()
## )
bike08 <- read_csv("./RInputFiles/BIXIData/OD_2017-08.csv")
## Parsed with column specification:
## cols(
##   start_date = col_datetime(format = ""),
##   start_station_code = col_integer(),
##   end_date = col_datetime(format = ""),
##   end_station_code = col_integer(),
##   duration_sec = col_integer(),
##   is_member = col_integer()
## )
bike09 <- read_csv("./RInputFiles/BIXIData/OD_2017-09.csv")
## Parsed with column specification:
## cols(
##   start_date = col_datetime(format = ""),
##   start_station_code = col_integer(),
##   end_date = col_datetime(format = ""),
##   end_station_code = col_integer(),
##   duration_sec = col_integer(),
##   is_member = col_integer()
## )
bike10 <- read_csv("./RInputFiles/BIXIData/OD_2017-10.csv")
## Parsed with column specification:
## cols(
##   start_date = col_datetime(format = ""),
##   start_station_code = col_integer(),
##   end_date = col_datetime(format = ""),
##   end_station_code = col_integer(),
##   duration_sec = col_integer(),
##   is_member = col_integer()
## )
bike11 <- read_csv("./RInputFiles/BIXIData/OD_2017-11.csv")
## Parsed with column specification:
## cols(
##   start_date = col_datetime(format = ""),
##   start_station_code = col_integer(),
##   end_date = col_datetime(format = ""),
##   end_station_code = col_integer(),
##   duration_sec = col_integer(),
##   is_member = col_integer()
## )
stations <- read_csv("./RInputFiles/BIXIData/Stations_2017.csv")
## Parsed with column specification:
## cols(
##   code = col_integer(),
##   name = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )
bike <- rbind(bike04, bike05, bike06, bike07, bike08, bike09, bike10, bike11) %>%
    mutate(membership=factor(is_member, levels=c(1, 0), labels=c("member", "non-member")), 
           start_day=as.Date(start_date), 
           start_dow=factor(lubridate::wday(start_date), levels=1:7, labels=c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")),
           weekday=factor(ifelse(start_dow %in% c("Sat", "Sun"), 2, 1), levels=1:2, labels=c("workweek", "weekend")), 
           start_hod=lubridate::hour(start_date), 
           start_mon=lubridate::month(start_date), 
           start_wk=lubridate::week(start_date)
           )
glimpse(bike)
## Observations: 4,740,357
## Variables: 13
## $ start_date         <dttm> 2017-04-15 00:00:00, 2017-04-15 00:01:00, ...
## $ start_station_code <int> 7060, 6173, 6203, 6104, 6174, 6719, 6223, 6...
## $ end_date           <dttm> 2017-04-15 00:31:00, 2017-04-15 00:10:00, ...
## $ end_station_code   <int> 7060, 6173, 6204, 6114, 6174, 6354, 6148, 6...
## $ duration_sec       <int> 1841, 553, 195, 285, 569, 620, 679, 311, 21...
## $ is_member          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ membership         <fct> member, member, member, member, member, mem...
## $ start_day          <date> 2017-04-15, 2017-04-15, 2017-04-15, 2017-0...
## $ start_dow          <fct> Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat...
## $ weekday            <fct> weekend, weekend, weekend, weekend, weekend...
## $ start_hod          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ start_mon          <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ start_wk           <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,...
# Compute daily counts
daily <- bike %>%
  group_by(start_day, weekday) %>%
  summarise(n = n())

# Plot the result
ggplot(daily, aes(x=start_day, y=n, color=weekday)) +
  geom_point()

# Compute week_hod
week_hod <- bike %>%
  group_by(start_wk, start_hod, weekday) %>%
  summarize(n=n())

# Plot the result
ggplot(week_hod, aes(x=start_wk, y=n, color=weekday)) +
  geom_point() +
  facet_grid(cols=vars(start_hod)) +
  scale_y_sqrt()

# Compute wk_memb_hod
wk_memb_hod <- bike %>%
  group_by(start_wk, start_hod, weekday, membership) %>%
  summarize(n=n())

# Plot the result
ggplot(wk_memb_hod, aes(x=start_wk, y=n, color=weekday)) +
  geom_point() +
  facet_grid(membership ~ start_hod) +
  scale_y_sqrt()

# Compute daily_may
daily_may <- bike %>%
  filter(start_mon == 5) %>%
  group_by(start_day, start_hod, membership) %>%
  summarise(n = n())

# Plot the result
ggplot(daily_may, aes(x=start_hod, y=n, color=membership)) +
  geom_point() + 
  facet_wrap(~ start_day, ncol=7)

# ggplot(daily_may, aes(x=start_hod, y=n, color = membership)) +
#   geom_point() +
  # Facet on start_day
#   facet_trelliscope(~ start_day, nrow=2, ncol=7)


# Function to construct a Google maps URL with cycling directions
make_gmap_url <- function(start_lat, start_lon, end_lat, end_lon) {
  paste0("https://www.google.com/maps/dir/?api=1",
    "&origin=", start_lat, ",", start_lon,
    "&destination=", end_lat, ",", end_lon,
    "&travelmode=bicycling")
}

load("./RInputFiles/route_hod.RData")
glimpse(route_hod)
## Observations: 4,114
## Variables: 11
## $ start_station_code <int> 6012, 6012, 6012, 6012, 6012, 6012, 6012, 6...
## $ end_station_code   <int> 6015, 6015, 6015, 6015, 6015, 6015, 6015, 6...
## $ start_hod          <dbl> 0, 0, 1, 1, 2, 2, 3, 3, 4, 5, 6, 7, 7, 8, 9...
## $ weekday            <fct> workweek, weekend, workweek, weekend, workw...
## $ n                  <int> 12, 13, 11, 2, 2, 6, 3, 3, 1, 1, 2, 18, 1, ...
## $ start_station_name <chr> "Métro St-Laurent (de Maisonneuve / St-Laur...
## $ start_lat          <dbl> 45.51066, 45.51066, 45.51066, 45.51066, 45....
## $ start_lon          <dbl> -73.56497, -73.56497, -73.56497, -73.56497,...
## $ end_station_name   <chr> "Berri / de Maisonneuve", "Berri / de Maiso...
## $ end_lat            <dbl> 45.5153, 45.5153, 45.5153, 45.5153, 45.5153...
## $ end_lon            <dbl> -73.56127, -73.56127, -73.56127, -73.56127,...
# Compute tot_rides, weekday_diff, and map_url
route_hod_updated <- route_hod %>%
  group_by(start_station_code, end_station_code) %>%
  mutate(
    tot_rides = sum(n),
    weekday_diff = mean(n[weekday == "workweek"]) - mean(n[weekday == "weekend"]),
    map_url = make_gmap_url(start_lat, start_lon, end_lat, end_lon))


# Create the plot
# ggplot(route_hod, aes(x=start_hod, y=n, color=weekday)) +
  # geom_point(size=3) + 
  # facet_trelliscope(~start_station_name + end_station_name, nrow=2, ncol=4) + 
  # theme(legend.position = "none")

Visualization Best Practices in R

Chapter 1 - Proportions of a Whole

Course/Grammar of Graphics Information:

  • General objective is to think deeply about the data and make the best visualization based on the data at hand
  • Will cover standard visualization techniques and alternative visualization techniques for improvement
    • Topics are not cut and dry, rules will have exceptions, and the emphasis should be on thinking through the specific issue at hand
  • Dataset will be available from WHO - measles, mumps, etc., for 7 diseases
    • who_disease
    • amr_region <- who_disease %>% filter(region == ‘AMR’)
    • ggplot(amr_region, aes(x = year, y = cases, color = disease)) + geom_point(alpha = 0.5)

Pie Chart and Friends:

  • May want to examine the proportions of a single population - adds up to 100%
  • The pie chart is often one of the first techniques learned; but, often abused (too many slices, 3D, not adding up to 100%, and the like)
    • Data encoded in angles, but humans are better at comparing lengths
    • After three slices, it typically becomes hard to compare (not just angles, but offset angles)
    • who_disease %>% mutate( region = ifelse( region %in% c(‘EUR’, ‘AFR’), region, ‘Other’) ) %>% ggplot(aes(x = 1, fill = region)) + geom_bar(color = ‘white’) + coord_polar(theta = “y”) + theme_void()
  • The waffle chart can be more precise; encoding data by area (add the squares) rather than by angle
    • obs_by_region <- who_disease %>% group_by(region) %>% summarise(num_obs = n()) %>% mutate(percent = round(num_obs/sum(num_obs)*100))
    • percent_by_region <- obs_by_region$percent
    • names(percent_by_region) <- obs_by_region$region
    • waffle::waffle(percent_by_region, rows = 5)

When to use Bars:

  • May want to compare multiple proportions to each other - facets are not ideal for pie or waffle charts
  • Can instead use stacked bar charts with a common y-axis
    • who_disease %>% filter(region == ‘SEAR’) %>% ggplot(aes(x = countryCode, y = cases, fill = disease)) + geom_col(position = ‘fill’) # position=“fill” makes it a proportion chart
  • Stacked bars tend to be worse in isolation by themselves (only the first and last class have a good anchor)
  • Generalized best practices include 1) do not make a stacked bar chart in isolation, and 2) keep the number of groups reasonably small

Example code includes:

who_disease <- readr::read_csv("./RInputFiles/who_disease.csv")
## Parsed with column specification:
## cols(
##   region = col_character(),
##   countryCode = col_character(),
##   country = col_character(),
##   disease = col_character(),
##   year = col_double(),
##   cases = col_double()
## )
glimpse(who_disease)
## Observations: 43,262
## Variables: 6
## $ region      <chr> "EMR", "EUR", "AFR", "EUR", "AFR", "AMR", "AMR", "...
## $ countryCode <chr> "AFG", "ALB", "DZA", "AND", "AGO", "ATG", "ARG", "...
## $ country     <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "A...
## $ disease     <chr> "measles", "measles", "measles", "measles", "measl...
## $ year        <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20...
## $ cases       <dbl> 638, 17, 41, 0, 53, 0, 0, 2, 99, 27, 0, 0, 0, 972,...
# set x aesthetic to region column
ggplot(who_disease, aes(x=region)) +
  geom_bar()

# filter data to AMR region. 
amr_region <- who_disease %>%
    filter(region=="AMR")

# map x to year and y to cases. 
ggplot(amr_region, aes(x=year, y=cases)) + 
    # lower alpha to 0.5 to see overlap.   
    geom_point(alpha=0.5)

# Wrangle data into form we want. 
disease_counts <- who_disease %>%
    mutate(disease = ifelse(disease %in% c('measles', 'mumps'), disease, 'other')) %>%
    group_by(disease) %>%
    summarise(total_cases = sum(cases))


ggplot(disease_counts, aes(x = 1, y = total_cases, fill = disease)) +
    # Use a column geometry.
    geom_col() + 
    # Change coordinate system to polar and set theta to 'y'.
    coord_polar(theta="y")

ggplot(disease_counts, aes(x = 1, y = total_cases, fill = disease)) +
    # Use a column geometry.
    geom_col() +
    # Change coordinate system to polar.
    coord_polar(theta = "y") +
    # Clean up the background with theme_void and give it a proper title with ggtitle.
    theme_void() +
    ggtitle('Proportion of diseases')

disease_counts <- who_disease %>%
    group_by(disease) %>%
    summarise(total_cases = sum(cases)) %>% 
    mutate(percent = round(total_cases/sum(total_cases)*100))

# Create an array of rounded percentages for diseases.
case_counts <- disease_counts$percent
# Name the percentage array with disease_counts$disease
names(case_counts) <- disease_counts$disease

# Pass case_counts vector to the waffle function to plot
waffle::waffle(case_counts)

disease_counts <- who_disease %>%
    mutate(disease = ifelse(disease %in% c('measles', 'mumps'), disease, 'other')) %>%
    group_by(disease, year) %>% # note the addition of year to the grouping.
    summarise(total_cases = sum(cases))

# add the mapping of year to the x axis. 
ggplot(disease_counts, aes(x=year, y = total_cases, fill = disease)) +
    # Change the position argument to make bars full height
    geom_col(position="fill")

disease_counts <- who_disease %>%
    mutate(
        disease = ifelse(disease %in% c('measles', 'mumps'), disease, 'other') %>%
            factor(levels=c('measles', 'other', 'mumps')) # change factor levels to desired ordering
        ) %>%
    group_by(disease, year) %>%
    summarise(total_cases = sum(cases))

# plot
ggplot(disease_counts, aes(x = year, y = total_cases, fill = disease)) +
    geom_col(position = 'fill')

disease_counts <- who_disease %>%
    # Filter to on or later than 1999  
    filter(year >= 1999) %>% 
    mutate(disease = ifelse(disease %in% c('measles', 'mumps'), disease, 'other')) %>%
    group_by(disease, region) %>%    # Add region column to grouping
    summarise(total_cases = sum(cases))

# Set aesthetics so disease is the stacking variable, region is the x-axis and counts are the y
ggplot(disease_counts, aes(x=region, y=total_cases, fill=disease)) +
    # Add a column geometry with the proper position value.
    geom_col(position="fill")


Chapter 2 - Point Data

Point Data:

  • Switching from proportions to points (numerical observations for multiple categories) - single observations per item
  • Most common visualization is a bar chart
    • ggplot(who_disease) + geom_col(aes(x = disease, y = cases))
  • Bar charts are frequently used incorrectly for the task at hand; really best for accumulating measures (money can be thought of a stack of coins)
    • However, if the bars are something non-stackable (such as likelihood of error), then bars are not appropriate
    • Humans tend to perceive bars as meaning “everything inside the bar is included, and everything outside the bar is excluded”

Point Charts:

  • When a point is not really a stackable quantity (percentile, temperature, log-transform, etc.), alternative to bar charts should be used
  • Instead, a point right at the top of the bar may work best; also known as a dot chart; high precision and simple
    • who_subset %>% ggplot(aes(y = country, x = log10(cases_2016))) + geom_point()
  • Can reorder the point charts, applied directly in the aesthetic
    • who_subset %>% mutate(logFoldChange = log2(cases_2016/cases_2006)) %>% ggplot(aes(x = logFoldChange, y = reorder(country, logFoldChange))) + geom_point()

Tuning Charts:

  • Can make the default charts more efficient and attractive
    • busy_bars <- who_disease %>% filter(region == ‘EMR’, disease == ‘measles’, year == 2015) %>% ggplot(aes(x = country, y = cases)) + geom_col() # Base plot
  • Flipping bar charts can make labels easier to read
    • busy_bars + coord_flip() # swap x and y axes!
  • Can also get rid of excess grid lines along the categorical axis
    • get rid of vertical grid lines

    • plot + theme( panel.grid.major.x = element_blank() )
  • Lighter backgrounds can make for better contrasts in a point chart
    • who_subset %>% ggplot(aes(y = reorder(country, cases_2016), x = log10(cases_2016))) + geom_point(size = 2) + theme_minimal()

Example code includes:

who_disease %>% 
    # filter to india in 1980
    filter(country=="India", year==1980) %>% 
    # map x aesthetic to disease and y to cases
    ggplot(aes(x=disease, y=cases)) +
    # use geom_col to draw
    geom_col()

who_disease %>%
    # filter data to observations of greater than 1,000 cases
    filter(cases > 1000) %>%
    # map the x-axis to the region column
    ggplot(aes(x=region)) +
    # add a geom_bar call
    geom_bar()

interestingCountries <- c('NGA', 'SDN', 'FRA', 'NPL', 'MYS', 'TZA', 'YEM', 'UKR', 'BGD', 'VNM')
who_subset <- who_disease %>% 
    filter(countryCode %in% interestingCountries, disease == 'measles', year %in% c(1992, 2002)) %>% 
    mutate(year = paste0('cases_', year)) %>%
    spread(year, cases)

 
# Reorder y axis and change the cases year to 1992
ggplot(who_subset, aes(x = log10(cases_1992), y = reorder(country, cases_1992))) +
    geom_point()

who_subset %>% 
    # calculate the log fold change between 2016 and 2006
    mutate(logFoldChange = log2(cases_2002/cases_1992)) %>% 
    # set y axis as country ordered with respect to logFoldChange
    ggplot(aes(x = logFoldChange, y = reorder(country, logFoldChange))) +
    geom_point() +
    # add a visual anchor at x = 0
    geom_vline(xintercept=0)

who_subset %>% 
    mutate(logFoldChange = log2(cases_2002/cases_1992)) %>% 
    ggplot(aes(x = logFoldChange, y = reorder(country, logFoldChange))) +
    geom_point() +
    geom_vline(xintercept = 0) +
    xlim(-6,6) +
    # add facet_grid arranged in the column direction by region and free_y scales
    facet_grid(region ~ ., scale="free_y")

amr_pertussis <- who_disease %>% 
    filter(region == 'AMR', year == 1980, disease == 'pertussis')


# Set x axis as country ordered with respect to cases. 
ggplot(amr_pertussis, aes(x = reorder(country, cases), y = cases)) +
    geom_col() +
    # flip axes
    coord_flip()

amr_pertussis %>% 
    # filter to countries that had > 0 cases. 
    filter(cases > 0) %>%
    ggplot(aes(x = reorder(country, cases), y = cases)) +
    geom_col() +
    coord_flip() +
    theme(panel.grid.major.y = element_blank())

amr_pertussis %>% filter(cases > 0) %>% 
    ggplot(aes(x = reorder(country, cases), y = cases)) + 
    # switch geometry to points and set point size = 2
    geom_point(size=2) + 
    # change y-axis to log10. 
    scale_y_log10() +
    # add theme_minimal()
    theme_minimal() +
    coord_flip()


Chapter 3 - Single Distributions

Importance of Distributions:

  • Distribution data includes multiple observations from the same population
  • Plotting distribution is valuable in many ways
    • Identifying errors or outliers
    • Identifying multiple peaks; potential segmenting variable
    • Distributions are more accurate and truthful to the data than a summary statistic
  • Histograms are a common approach to showing distributions
  • Box plots are a common approach for comparing multiple distributions
  • Will examing the md_speeding dataset from Montgomery County, MD
    • md_speeding %>% filter(vehicle_color == ‘BLUE’) %>% ggplot(aes(x = speed)) + geom_histogram()

Histogram Nuances:

  • Histograms have the advantage of being intuitive to readers and easier to interpret; higher bars mean more frequent
  • Histograms need care in setting the number of bins for better interpretation; binning can significantly change the plot
    • If you have 150+ points, set the bins to 100 and call it a day

Kernel Density Estimates:

  • The KDE (kernel density estimator) is a common approach to managing the binning and other challenges associated with histograms
    • Kernel function is created on top of every point, then summed across the distribution
    • Typically, a Gaussian kernel is used, though there are other options
    • sample_n(md_speeding, 100) %>% ggplot(aes(x = percentage_over_limit)) + geom_density( fill = ‘steelblue’, bw = 8 )
  • For the KDE, need to choose the width of the kernel, which is controlled by the bw parameter (analogous to sd for Gaussians)
  • Can also show all of the data using the rug plot
    • p <-sample_n(md_speeding, 100) %>% ggplot(aes(x = percentage_over_limit)) + geom_density( fill = ‘steelblue’, # fill in curve with color bw = 8 # standard deviation of kernel )
    • p + geom_rug(alpha = 0.4)

Example code includes:

colKeep <- c('work_zone', 'vehicle_type', 'vehicle_year', 'vehicle_color', 'race', 'gender', 
             'driver_state', 'speed_limit', 'speed', 'day_of_week', 'day_of_month', 'month', 
             'hour_of_day', 'speed_over', 'percentage_over_limit'
             )
colRead <- c("Work Zone", "VehicleType", "Year", "Color", "Race", "Gender", 
             "DL State", "Description", "Date Of Stop", "Time Of Stop")
regFind <- ".*EXCEEDING MAXIMUM SPEED: ([0-9]+) MPH .* POSTED ([0-9]+) MPH .*"

md_speeding <- readr::read_csv("./RInputFiles/MD_Traffic/Traffic_violations.csv", n_max=200000) %>%
    select(colRead) %>%
    filter(grepl("EXCEEDING MAXIMUM SPEED: ", Description)) %>%
    rename(work_zone="Work Zone", vehicle_type=VehicleType, vehicle_year=Year, 
           vehicle_color=Color, race=Race, gender=Gender, driver_state="DL State", 
           stopDate="Date Of Stop", stopTime="Time Of Stop") %>%
    mutate(speed_limit=as.integer(gsub(regFind, "\\2", Description)), 
           speed=as.integer(gsub(regFind, "\\1", Description)), 
           speed_over=speed - speed_limit, 
           percentage_over_limit=100 * speed_over / speed_limit, 
           stopDate=as.Date(stopDate, format="%m/%d/%Y"), 
           day_of_week=lubridate::wday(stopDate), 
           day_of_month=lubridate::day(stopDate), 
           month=lubridate::month(stopDate), 
           hour_of_day=lubridate::hour(stopTime)
           )
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Time Of Stop` = col_time(format = ""),
##   Latitude = col_double(),
##   Longitude = col_double(),
##   Year = col_integer()
## )
## See spec(...) for full column specifications.
# Print data to console
glimpse(md_speeding)
## Observations: 11,935
## Variables: 18
## $ work_zone             <chr> "No", "No", "No", "No", "No", "No", "No"...
## $ vehicle_type          <chr> "02 - Automobile", "08 - Recreational Ve...
## $ vehicle_year          <int> 2004, 2012, 1999, 2007, 2008, 2013, 2006...
## $ vehicle_color         <chr> "GOLD", "GRAY", "BLACK", "BLACK", "BLACK...
## $ race                  <chr> "OTHER", "WHITE", "OTHER", "HISPANIC", "...
## $ gender                <chr> "M", "F", "M", "F", "F", "M", "M", "M", ...
## $ driver_state          <chr> "MD", "MD", "MD", "MD", "MD", "MD", "PA"...
## $ Description           <chr> "EXCEEDING MAXIMUM SPEED: 49 MPH IN A PO...
## $ stopDate              <date> 2013-03-20, 2013-05-27, 2012-07-26, 201...
## $ stopTime              <time> 08:53:00, 14:13:00, 09:24:00, 10:40:00,...
## $ speed_limit           <int> 40, 55, 35, 55, 55, 40, 55, 40, 40, 45, ...
## $ speed                 <int> 49, 80, 44, 92, 64, 49, 74, 49, 49, 68, ...
## $ speed_over            <int> 9, 25, 9, 37, 9, 9, 19, 9, 9, 23, 19, 16...
## $ percentage_over_limit <dbl> 22.50000, 45.45455, 25.71429, 67.27273, ...
## $ day_of_week           <dbl> 4, 2, 5, 7, 4, 3, 5, 1, 6, 7, 5, 1, 2, 6...
## $ day_of_month          <int> 20, 27, 26, 13, 31, 1, 26, 1, 12, 26, 11...
## $ month                 <dbl> 3, 5, 7, 7, 7, 10, 2, 9, 7, 10, 9, 5, 3,...
## $ hour_of_day           <int> 8, 14, 9, 10, 13, 17, 20, 8, 19, 16, 10,...
# Change filter to red cars
md_speeding %>% 
    filter(vehicle_color == 'RED') %>% 
    # switch x mapping to speed_over column
    ggplot(aes(x = speed_over)) +
    geom_histogram() +
    # give plot a title
    ggtitle('MPH over speed limit | Red cars')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(md_speeding) + 
    # Add the histogram geometry with x mapped to speed_over
    geom_histogram(aes(x=speed_over), alpha=0.7) +
    # Add minimal theme
    theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(md_speeding) +
    geom_histogram(aes(x=hour_of_day, y=stat(density)), alpha=0.8)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Load md_speeding into ggplot
ggplot(md_speeding) +
  # add a geom_histogram with x mapped to percentage_over_limit
    geom_histogram(
        aes(x=percentage_over_limit), 
        bins=40,     # set bin number to 40
        alpha=0.8)    # reduce alpha to 0.8

ggplot(md_speeding) +
    geom_histogram(
        aes(x = percentage_over_limit),
        bins = 100 ,         # switch to 100 bins
        fill="steelblue",   # set the fill of the bars to 'steelblue'
        alpha = 0.8 )

ggplot(md_speeding, aes(x = hour_of_day)) +
    geom_histogram(
        binwidth=1,  # set binwidth to 1
        center=0.5,  # Center bins at the half (0.5) hour
    ) +
    scale_x_continuous(breaks = 0:24)

# filter data to just heavy duty trucks
truck_speeding <- md_speeding %>% 
    filter(vehicle_type == "06 - Heavy Duty Truck")
 
ggplot(truck_speeding, aes(x = hour_of_day)) +
    # switch to density with bin width of 1.5, keep fill 
    geom_density(fill = 'steelblue', bw=1.5) +
    # add a subtitle stating binwidth
    labs(title = 'Citations by hour', subtitle="Gaussian kernel SD = 1.5")

ggplot(truck_speeding, aes(x = hour_of_day)) +
    # Adjust opacity to see gridlines with alpha = 0.7
    geom_density(bw = 1.5, fill = 'steelblue', alpha=0.7) +
    # add a rug plot using geom_rug to see individual datapoints, set alpha to 0.5.
    geom_rug(alpha=0.5) +
    labs(title = 'Citations by hour', subtitle = "Gaussian kernel SD = 1.5")

ggplot(md_speeding, aes(x = percentage_over_limit)) +
    # Increase bin width to 2.5
    geom_density(fill = 'steelblue', bw = 2.5,  alpha = 0.7) + 
    # lower rugplot alpha to 0.05
    geom_rug(alpha = 0.05) + 
    labs(
        title = 'Distribution of % over speed limit', 
        # modify subtitle to reflect change in kernel width
        subtitle = "Gaussian kernel SD = 2.5"
    )


Chapter 4 - Comparing Distributions

Introduction to Comparing Distributions:

  • Box plots can be helpful for comparing across key covariates, especially since multiple histograms are space-inefficient
    • Include median, IQR, 1.5*IQR, Outliers
    • Can be helpful to see the actual data rather than just the boxplot
    • md_speeding %>%
    • filter(vehicle_color == ‘BLUE’) %>%
    • ggplot(aes(x = gender, y = speed)) +
    • geom_jitter(alpha = 0.3, color = ‘steelblue’) +
    • geom_boxplot(alpha = 0) + # make transparent
    • labs(title = ‘Distribution of speed for blue cars by gender’)

Bee Swarms and Violins:

  • Jittering is helpful to see quantities of data, but bee swarms or violin plots can help summarize data
  • The bee swarm is a smart version of a jitter plot, with denser points stacking in an area (shape shows quantity of data)
    • library(ggbeeswarm)
    • ggplot(data, aes(y = y, x = group)) + geom_beeswarm(color = ‘steelblue’)
  • Violin plots are KDE that are symmetric around the categorical axis
    • ggplot(data, aes(y = y, x = group)) + geom_violin(fill = ‘steelblue’)

Comparing Spatially Related Distributions:

  • Bee swarm and violin plots are not ideal when data are spatially connected (ordering to the categorical axis, such as longitudinal time series data)
  • Ridge-line plots can be great for seeing how the KDE evolves over time; good for seeing individual KDE as well as evolutions over time
    • library(ggridges) # gives us geom_density_ridges()
    • ggplot(md_speeding, aes(x=speed_over, y=month)) + geom_density_ridges(bandwidth = 2) + xlim(1, 35)

Wrap Up:

  • Subtle things can take a visualization fro good to great
  • Proportions - pie charts can be OK, as can waffle charts, stacked bars, etc.
  • Point Data - bar charts are good for stackable data, while points are good for non-stockable data
    • Remove grid lines where needed and order by value unless categories having inherne tvalue
  • Single distributions - histograms, KDE, rug charts, etc.
  • Comparing distributions - box plots, jitter plots, violin plots, beeswarm plots, etc.
  • Further exploration can include
    • Flowing data blog
    • Datawrapper blog
    • Twitter #datavis
    • “Data Visualization” by Andy Kirk, “Functional Art and Truthful Art” by Alberto Cairo

Example code includes:

md_speeding %>% 
    filter(vehicle_color == 'RED') %>%
    # Map x and y to gender and speed columns respectively
    ggplot(aes(x=gender, y=speed)) + 
    # add a boxplot geometry
    geom_boxplot() +
    # give plot supplied title
    labs(title = 'Speed of red cars by gender of driver')

md_speeding %>% 
    filter(vehicle_color == 'RED') %>%
    ggplot(aes(x = gender, y = speed)) + 
    # add jittered points with alpha of 0.3 and color 'steelblue'
    geom_jitter(alpha=0.3, color="steelblue") +
    # make boxplot transparent with alpha = 0
    geom_boxplot(alpha=0) +
    labs(title = 'Speed of red cars by gender of driver')

# remove color filter
md_speeding %>%
    ggplot(aes(x = gender, y = speed)) + 
    geom_jitter(alpha = 0.3, color = 'steelblue') +
    geom_boxplot(alpha = 0) +
    # add a facet_wrap by vehicle_color
    facet_wrap(~ vehicle_color) +
    # change title to reflect new faceting
    labs(title = 'Speed of different car colors, separated by gender of driver')

md_speeding %>% 
    filter(vehicle_color == 'RED') %>%
    ggplot(aes(x = gender, y = speed)) + 
    # change point size to 0.5 and alpha to 0.8
    ggbeeswarm::geom_beeswarm(cex=0.5, alpha=0.8) +
    # add a transparent boxplot on top of points
    geom_boxplot(alpha=0)

md_speeding %>% 
    filter(vehicle_color == 'RED') %>%
    ggplot(aes(x = gender, y = speed)) + 
    # Replace beeswarm geometry with a violin geometry with kernel width of 2.5
    geom_violin(bw = 2.5) +
    # add individual points on top of violins
    geom_point(alpha=0.3, size=0.5)

md_speeding %>% 
    filter(vehicle_color == 'RED') %>%
    ggplot(aes(x = gender, y = speed)) + 
    geom_violin(bw = 2.5) +
    # add a transparent boxplot and shrink its width to 0.3
    geom_boxplot(alpha=0, width=0.3) +
    # Reset point size to default and set point shape to 95
    geom_point(alpha = 0.3, shape = 95) +
    # Supply a subtitle detailing the kernel width
    labs(subtitle = 'Gaussian kernel SD = 2.5')

md_speeding %>% 
    ggplot(aes(x = gender, y = speed)) + 
    # replace with violin plot with kernel width of 2.5, change color argument to fill 
    geom_violin(bw = 2.5, fill = 'steelblue') +
    # reduce width to 0.3
    geom_boxplot(alpha = 0, width=0.3) +
    facet_wrap(~vehicle_color) +
    labs(
        title = 'Speed of different car colors, separated by gender of driver',
        # add a subtitle w/ kernel width
        subtitle = "Gaussian kernel width: 2.5"
    )
## Warning in max(data$density): no non-missing arguments to max; returning -
## Inf

## Warning in max(data$density): no non-missing arguments to max; returning -
## Inf

md_speeding %>% 
    mutate(day_of_week = factor(day_of_week, levels=c(2, 3, 4, 5, 6, 7, 1), 
                                labels = c("Mon","Tues","Wed","Thu","Fri","Sat","Sun")
                                )
           ) %>% 
    ggplot(aes( x = percentage_over_limit, y = day_of_week)) + 
    # Set bandwidth to 3.5
    ggridges::geom_density_ridges(bandwidth=3.5) +
    # add limits of 0 to 150 to x-scale
    scale_x_continuous(limits=c(0, 150)) + 
    # provide subtitle with bandwidth
    labs(subtitle='Gaussian kernel SD = 3.5')
## Warning: Removed 9 rows containing non-finite values (stat_density_ridges).

md_speeding %>% 
    mutate(day_of_week = factor(day_of_week, levels=c(2, 3, 4, 5, 6, 7, 1), 
                                labels = c("Mon","Tues","Wed","Thu","Fri","Sat","Sun")
                                )
           ) %>% 
    ggplot(aes( x = percentage_over_limit, y = day_of_week)) + 
    # make ridgeline densities a bit see-through with alpha = 0.7
    ggridges::geom_density_ridges(bandwidth = 3.5, alpha=0.7) +
    # set expand values to c(0,0)
    scale_x_continuous(limits = c(0,150), expand=c(0, 0)) +
    labs(subtitle = 'Guassian kernel SD = 3.5') +
    # remove y axis ticks
    theme(axis.ticks.y=element_blank())
## Warning: Removed 9 rows containing non-finite values (stat_density_ridges).

md_speeding %>% 
    mutate(day_of_week = factor(day_of_week, levels=c(2, 3, 4, 5, 6, 7, 1), 
                                labels = c("Mon","Tues","Wed","Thu","Fri","Sat","Sun")
                                )
           ) %>% 
    ggplot(aes( x = percentage_over_limit, y = day_of_week)) + 
    geom_point(
        # make semi-transparent with alpha = 0.2
        alpha=0.2, 
        # turn points to vertical lines with shape = '|'
        shape="|", 
        # nudge the points downward by 0.05
        position=position_nudge(y=-0.05)
    ) +
    ggridges::geom_density_ridges(bandwidth = 3.5, alpha = 0.7) +
    scale_x_continuous(limits = c(0,150), expand  = c(0,0)) +
    labs(subtitle = 'Guassian kernel SD = 3.5') +
    theme( axis.ticks.y = element_blank() )
## Warning: Removed 9 rows containing non-finite values (stat_density_ridges).
## Warning: Removed 9 rows containing missing values (geom_point).


Linear Algebra for Data Science in R

Chapter 1 - Introduction to Linear Algebra

Motivations:

  • Vectors, matrixs, tensors, and associated operations on and between them
  • Vectors are the most basic, non-trivial element for storing data
    • Generally shown with an arrow above the variable name
    • Can be transposed between row vectors and column vectors
  • Can create vectors in R in many ways
    • rep()
    • seq()
    • c()
    • z[n] <- a # replace element n of z with a
  • Matrices are tables of data with rows and columns; data frames can be considered a form of matrix
  • Can create matrices in R in many ways
    • matrix(data, nrow, ncol, byrow=FALSE) # set byrow=TRUE to fill the matrix by row
    • A[a, b] <- c # sets the value of row a and column b of matrix A to c

Matrix-Vector Operations:

  • May want to convert or run mathematical operations on vectors and matrices
    • axb %*% bxc will form an axc
  • The asterisk between percents is matrix multiplication (%*%)
  • Vector-vector multiplication is the dot-product
  • Matrix-vector multiplication requires that the vector have the same number of elements as the matrix has columns

Matrix-Matrix Calculations:

  • Matrix-matrix calculations can be useful for predictions (e.g., neural networks)
    • (mxn) %*% (nxr) = (mxr)
    • Order matters - A%%B is not the same as B%%A
    • The asterisk will give component-wise multiplication; AB is not the same as A%%B
  • The identity matrix I is a diagnonal matrix with 1 on the diagonal and 0 elsewhere
  • Additional important types of matrices
    • Square matrices are a special matrix where columns and rows are the same
    • Inverse matrices can be multiplied to create the identity matrix
    • Singular matrices do not have an inverse
    • Diagonal and triangular matrices, which are more computationally efficient

Example code includes:

# Creating three 3's and four 4's, respectively
rep(3, 3)
## [1] 3 3 3
rep(4, 4)
## [1] 4 4 4 4
# Creating a vector with the first three even numbers and the first three odd numbers
seq(2, 6, by = 2)
## [1] 2 4 6
seq(1, 5, by = 2)
## [1] 1 3 5
# Re-creating the previous four vectors using the 'c' command
c(3, 3, 3)
## [1] 3 3 3
c(4, 4, 4, 4)
## [1] 4 4 4 4
c(2, 4, 6)
## [1] 2 4 6
c(1, 3, 5)
## [1] 1 3 5
x <- 1:7
y <- 2*x
z <- c(1, 1, 2)

# Add x to y and print
print(x + y)
## [1]  3  6  9 12 15 18 21
# Multiply z by 2 and print
print(2 * z)
## [1] 2 2 4
# Multiply x and y by each other and print
print(x * y)
## [1]  2  8 18 32 50 72 98
# Add x to z, if possible, and print
print(x + z)  # should throw a warning for the recycling problem
## Warning in x + z: longer object length is not a multiple of shorter object
## length
## [1] 2 3 5 5 6 8 8
A <- matrix(1, 2, 2)

# Create a matrix of all 1's and all 2's that are 2 by 3 and 3 by 2, respectively
matrix(1, 2, 3)
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1
print(matrix(2, 3, 2))
##      [,1] [,2]
## [1,]    2    2
## [2,]    2    2
## [3,]    2    2
# Create a matrix and changing the byrow designation.
matrix(c(1, 2, 3, 2), nrow = 2, ncol = 2, byrow = FALSE)
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    2
matrix(c(1, 2, 3, 2), nrow = 2, ncol = 2, byrow = TRUE)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    2
# Add A to the previously-created matrix
A + matrix(c(1, 2, 3, 2), nrow = 2, ncol = 2, byrow = TRUE)
##      [,1] [,2]
## [1,]    2    3
## [2,]    4    3
A <- matrix(data=c(4, 0, 0, 1), nrow=2, ncol=2, byrow=FALSE)
b <- c(1, 1)
B <- matrix(data=c(1, 0, 0, 2/3), nrow=2, ncol=2, byrow=FALSE)


# Multiply A by b on the left
A %*% b
##      [,1]
## [1,]    4
## [2,]    1
# Multiply B by b on the left
B %*% b
##           [,1]
## [1,] 1.0000000
## [2,] 0.6666667
b <- c(2, 1)
A <- matrix(data=c(-1, 0, 0, 1), nrow=2, ncol=2, byrow=FALSE)
B <- matrix(data=c(1, 0, 0, -1), nrow=2, ncol=2, byrow=FALSE)
C1 <- matrix(data=c(-4, 0, 0, -2), nrow=2, ncol=2, byrow=FALSE)

# Multiply A by b on the left
A%*%b
##      [,1]
## [1,]   -2
## [2,]    1
# Multiplby B by b on the left
B%*%b
##      [,1]
## [1,]    2
## [2,]   -1
# Multiply C by b on the left
C1%*%b
##      [,1]
## [1,]   -8
## [2,]   -2
A <- matrix(data=sqrt(2)*c(1, 1, -1, 1), nrow=2, ncol=2, byrow=FALSE)
B <- matrix(data=c(1, 0, 0, -1), nrow=2, ncol=2, byrow=FALSE)
b <- c(1, 1)

# Multply A by B on the left
A%*%B
##          [,1]      [,2]
## [1,] 1.414214  1.414214
## [2,] 1.414214 -1.414214
# Multiply A by B on the right
B%*%A
##           [,1]      [,2]
## [1,]  1.414214 -1.414214
## [2,] -1.414214 -1.414214
# Multiply b by B then A (on the left)
A%*%B%*%b
##          [,1]
## [1,] 2.828427
## [2,] 0.000000
# Multiply b by A then B (on the left)
B%*%A%*%b
##           [,1]
## [1,]  0.000000
## [2,] -2.828427
A <- matrix(data=c(1, -1, 2, 2), nrow=2, ncol=2, byrow=FALSE)

# Take the inverse of the 2 by 2 identity matrix
solve(diag(2))
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
# Take the inverse of the matrix A
Ainv <- solve(A)

# Multiply A by its inverse on the left
Ainv%*%A
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
# Multiply A by its inverse on the right
A%*%Ainv
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

Chapter 2 - Matrix-Vector Equations

Motivation for Solving Matrix-Vector Equations:

  • Question is often whether a build from atromic objects is possible and unique
    • Assuming that this question is linear, it can be solved using matrices and vectors
  • Vectors times constants is called a linear combination
    • The question is often whether vector b can be produced as linear combinations of vector x
    • Ax = b
  • Can consider a matrix of performances in an n-team league as follows - Massey matrix
    • Diagonals are each number of games that team has played (n-1 in a round-robin)
    • Off-diagonals are all negative the number of games between the two teams (-1 in a round-robin)
    • Matrix is nxn
  • Attempt is then to multiple the Massey matrix by an unknown vector R to get the actual net point differential of the teams

Matrix-Vector Equations Theory:

  • Need to know the circumatsnaces under which a matrix can be uniquely solved
    • Inconsistent systems cannot be solved
    • Consistently systems can be solved, but there may be infinitely-many solutions
  • When a matrix-vector has a solution but not infinite solutions, then it will have only one solution
  • If A is an nxn matrix, then the following conditions are equivalent and imply a solution to Ax = b
    • The matrix A must have an inverse - solve() will not produce an error
    • The determinant of A must be non-zero - det() will return a non-zero answer
    • The rows and columns of A form a basis for the set of all vectors with n elements

Solving Matrix-Vector Equations:

  • Solving equations in linear algebra is similar to solving equations in regular algebra
    • Ainv %% A gives I and I %% x gives x
    • x <- solve(A)%*%b
  • There is an additional matrix solution condition related to the zero vector
    • The homogenous equation Ax = 0 must have only the trivial solution x=0

Other Considerations for Matrix-Vector Equations:

  • With more equations than unknowns, then one or more of the equations must be redundant for a solution to exist
  • With fewer equations than unknowns, it is difficult to have a unique solution (cannot rule out the extra solutions)
    • However, solutions can exist or fail to exist depending on the redundancy of the columns
  • Options for non-square matrices include
    • Row Reduction
    • Least Squares
    • Singular Value Decomposition
    • Generalized or Pseudo-Inverses
  • Can calculate the Moore-Penrose Generalized Inverse
    • MASS::ginv(A)
    • ginv(A)%*%A # gives the identity matrix
    • A%*%ginv(A) # does not give the identity matrix
    • x <- ginv(A)%*%b # gives one of the solutions to Ax = b

Example code includes:

M <- readr::read_csv("./RInputFiles/WNBA_Data_2017_M.csv")
## Parsed with column specification:
## cols(
##   Atlanta = col_double(),
##   Chicago = col_double(),
##   Connecticut = col_double(),
##   Dallas = col_double(),
##   Indiana = col_double(),
##   `Los Angeles` = col_double(),
##   Minnesota = col_double(),
##   `New York` = col_double(),
##   Phoenix = col_double(),
##   `San Antonio` = col_double(),
##   Seattle = col_double(),
##   Washington = col_double(),
##   WNBA = col_double()
## )
glimpse(M)
## Observations: 13
## Variables: 13
## $ Atlanta       <dbl> 33, -4, -2, -3, -3, -3, -3, -3, -3, -3, -3, -3, 1
## $ Chicago       <dbl> -4, 33, -3, -3, -3, -3, -2, -3, -3, -3, -3, -3, 1
## $ Connecticut   <dbl> -2, -3, 34, -3, -3, -3, -3, -4, -4, -3, -3, -3, 1
## $ Dallas        <dbl> -3, -3, -3, 34, -3, -4, -3, -3, -2, -3, -3, -4, 1
## $ Indiana       <dbl> -3, -3, -3, -3, 33, -3, -3, -3, -3, -3, -2, -4, 1
## $ `Los Angeles` <dbl> -3, -3, -3, -4, -3, 41, -8, -3, -6, -3, -2, -3, 1
## $ Minnesota     <dbl> -3, -2, -3, -3, -3, -8, 41, -3, -4, -3, -3, -6, 1
## $ `New York`    <dbl> -3, -3, -4, -3, -3, -3, -3, 34, -3, -2, -3, -4, 1
## $ Phoenix       <dbl> -3, -3, -4, -2, -3, -6, -4, -3, 38, -3, -4, -3, 1
## $ `San Antonio` <dbl> -3, -3, -3, -3, -3, -3, -3, -2, -3, 32, -4, -2, 1
## $ Seattle       <dbl> -3, -3, -3, -3, -2, -2, -3, -3, -4, -4, 33, -3, 1
## $ Washington    <dbl> -3, -3, -3, -4, -4, -3, -6, -4, -3, -2, -3, 38, 1
## $ WNBA          <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1
names(M) <- stringr::str_replace(names(M), " ", ".")
M <- M %>%
    select(-WNBA) %>%
    slice(-n())
    
M <- as.data.frame(M)
row.names(M) <- names(M)


f <- readr::read_csv("./RInputFiles/WNBA_Data_2017_f.csv")
## Parsed with column specification:
## cols(
##   Differential = col_double()
## )
glimpse(f)
## Observations: 13
## Variables: 1
## $ Differential <dbl> -135, -171, 152, -104, -308, 292, 420, 83, -4, -2...
f <- slice(f, -n())

f <- as.data.frame(f)
row.names(f) <- names(M)


# Print the Massey Matrix M Here
print(M)
##             Atlanta Chicago Connecticut Dallas Indiana Los.Angeles
## Atlanta          33      -4          -2     -3      -3          -3
## Chicago          -4      33          -3     -3      -3          -3
## Connecticut      -2      -3          34     -3      -3          -3
## Dallas           -3      -3          -3     34      -3          -4
## Indiana          -3      -3          -3     -3      33          -3
## Los.Angeles      -3      -3          -3     -4      -3          41
## Minnesota        -3      -2          -3     -3      -3          -8
## New.York         -3      -3          -4     -3      -3          -3
## Phoenix          -3      -3          -4     -2      -3          -6
## San.Antonio      -3      -3          -3     -3      -3          -3
## Seattle          -3      -3          -3     -3      -2          -2
## Washington       -3      -3          -3     -4      -4          -3
##             Minnesota New.York Phoenix San.Antonio Seattle Washington
## Atlanta            -3       -3      -3          -3      -3         -3
## Chicago            -2       -3      -3          -3      -3         -3
## Connecticut        -3       -4      -4          -3      -3         -3
## Dallas             -3       -3      -2          -3      -3         -4
## Indiana            -3       -3      -3          -3      -2         -4
## Los.Angeles        -8       -3      -6          -3      -2         -3
## Minnesota          41       -3      -4          -3      -3         -6
## New.York           -3       34      -3          -2      -3         -4
## Phoenix            -4       -3      38          -3      -4         -3
## San.Antonio        -3       -2      -3          32      -4         -2
## Seattle            -3       -3      -4          -4      33         -3
## Washington         -6       -4      -3          -2      -3         38
# Print the vector of point differentials f here
print(f)
##             Differential
## Atlanta             -135
## Chicago             -171
## Connecticut          152
## Dallas              -104
## Indiana             -308
## Los.Angeles          292
## Minnesota            420
## New.York              83
## Phoenix               -4
## San.Antonio         -213
## Seattle               -5
## Washington            -7
# Find the sum of the first column 
sum(M[, 1])
## [1] 0
# Find the sum of the vector f
sum(f)
## [1] 0
M <- as.matrix(M)

# Add a row of 1's
M <- rbind(M, rep(1, 12))

# Add a column of -1's 
M <- cbind(M, rep(-1, 13))

# Change the element in the lower-right corner of the matrix M
M[13, 13] <- 1

# Print M
print(M)
##             Atlanta Chicago Connecticut Dallas Indiana Los.Angeles
## Atlanta          33      -4          -2     -3      -3          -3
## Chicago          -4      33          -3     -3      -3          -3
## Connecticut      -2      -3          34     -3      -3          -3
## Dallas           -3      -3          -3     34      -3          -4
## Indiana          -3      -3          -3     -3      33          -3
## Los.Angeles      -3      -3          -3     -4      -3          41
## Minnesota        -3      -2          -3     -3      -3          -8
## New.York         -3      -3          -4     -3      -3          -3
## Phoenix          -3      -3          -4     -2      -3          -6
## San.Antonio      -3      -3          -3     -3      -3          -3
## Seattle          -3      -3          -3     -3      -2          -2
## Washington       -3      -3          -3     -4      -4          -3
##                   1       1           1      1       1           1
##             Minnesota New.York Phoenix San.Antonio Seattle Washington   
## Atlanta            -3       -3      -3          -3      -3         -3 -1
## Chicago            -2       -3      -3          -3      -3         -3 -1
## Connecticut        -3       -4      -4          -3      -3         -3 -1
## Dallas             -3       -3      -2          -3      -3         -4 -1
## Indiana            -3       -3      -3          -3      -2         -4 -1
## Los.Angeles        -8       -3      -6          -3      -2         -3 -1
## Minnesota          41       -3      -4          -3      -3         -6 -1
## New.York           -3       34      -3          -2      -3         -4 -1
## Phoenix            -4       -3      38          -3      -4         -3 -1
## San.Antonio        -3       -2      -3          32      -4         -2 -1
## Seattle            -3       -3      -4          -4      33         -3 -1
## Washington         -6       -4      -3          -2      -3         38 -1
##                     1        1       1           1       1          1  1
#Find the inverse of M
solve(M)
##                  Atlanta      Chicago  Connecticut       Dallas
## Atlanta      0.032449804  0.005402927  0.003876665  0.004630004
## Chicago      0.005402927  0.032446789  0.004608094  0.004626913
## Connecticut  0.003876665  0.004608094  0.031714805  0.004613451
## Dallas       0.004630004  0.004626913  0.004613451  0.031707219
## Indiana      0.004629590  0.004628272  0.004629714  0.004649172
## Los.Angeles  0.004626242  0.004554829  0.004676789  0.005214940
## Minnesota    0.004611109  0.003985203  0.004651940  0.004727810
## New.York     0.004609212  0.004627729  0.005362761  0.004647832
## Phoenix      0.004610546  0.004608018  0.005295038  0.004013187
## San.Antonio  0.004630254  0.004631081  0.004608596  0.004609009
## Seattle      0.004629212  0.004631185  0.004646217  0.004595132
## Washington   0.004627769  0.004582295  0.004649264  0.005298666
##             -0.083333333 -0.083333333 -0.083333333 -0.083333333
##                  Indiana  Los.Angeles    Minnesota     New.York
## Atlanta      0.004629590  0.004626242  0.004611109  0.004609212
## Chicago      0.004628272  0.004554829  0.003985203  0.004627729
## Connecticut  0.004629714  0.004676789  0.004651940  0.005362761
## Dallas       0.004649172  0.005214940  0.004727810  0.004647832
## Indiana      0.032447936  0.004652111  0.004678479  0.004649262
## Los.Angeles  0.004652111  0.027807608  0.007319076  0.004637275
## Minnesota    0.004678479  0.007319076  0.027810474  0.004677632
## New.York     0.004649262  0.004637275  0.004677632  0.031716432
## Phoenix      0.004613089  0.006363490  0.005388578  0.004648253
## San.Antonio  0.004587382  0.004606288  0.004578013  0.003835528
## Seattle      0.003854641  0.004032687  0.004573214  0.004607331
## Washington   0.005313685  0.004841998  0.006331805  0.005314087
##             -0.083333333 -0.083333333 -0.083333333 -0.083333333
##                  Phoenix  San.Antonio      Seattle   Washington
## Atlanta      0.004610546  0.004630254  0.004629212  0.004627769
## Chicago      0.004608018  0.004631081  0.004631185  0.004582295
## Connecticut  0.005295038  0.004608596  0.004646217  0.004649264
## Dallas       0.004013187  0.004609009  0.004595132  0.005298666
## Indiana      0.004613089  0.004587382  0.003854641  0.005313685
## Los.Angeles  0.006363490  0.004606288  0.004032687  0.004841998
## Minnesota    0.005388578  0.004578013  0.004573214  0.006331805
## New.York     0.004648253  0.003835528  0.004607331  0.005314087
## Phoenix      0.029212019  0.004646110  0.005265228  0.004669776
## San.Antonio  0.004646110  0.033267202  0.005427397  0.003906474
## Seattle      0.005265228  0.005427397  0.032485332  0.004585756
## Washington   0.004669776  0.003906474  0.004585756  0.029211757
##             -0.083333333 -0.083333333 -0.083333333 -0.083333333
##                         
## Atlanta     8.333333e-02
## Chicago     8.333333e-02
## Connecticut 8.333333e-02
## Dallas      8.333333e-02
## Indiana     8.333333e-02
## Los.Angeles 8.333333e-02
## Minnesota   8.333333e-02
## New.York    8.333333e-02
## Phoenix     8.333333e-02
## San.Antonio 8.333333e-02
## Seattle     8.333333e-02
## Washington  8.333333e-02
##             2.220446e-16
f <- as.matrix(f)
f <- rbind(f, 0)

# Solve for r and rename column
r <- solve(M)%*%f
colnames(r) <- "Rating"

# Print r
print(r)
##                    Rating
## Atlanta     -4.012938e+00
## Chicago     -5.156260e+00
## Connecticut  4.309525e+00
## Dallas      -2.608129e+00
## Indiana     -8.532958e+00
## Los.Angeles  7.850327e+00
## Minnesota    1.061241e+01
## New.York     2.541565e+00
## Phoenix      8.979110e-01
## San.Antonio -6.181574e+00
## Seattle     -2.666953e-01
## Washington   5.468121e-01
##              1.043610e-14
# Find the rating vector using ginv
r <- MASS::ginv(M)%*%f
colnames(r) <- "Rating"
print(r)
##              Rating
##  [1,] -4.012938e+00
##  [2,] -5.156260e+00
##  [3,]  4.309525e+00
##  [4,] -2.608129e+00
##  [5,] -8.532958e+00
##  [6,]  7.850327e+00
##  [7,]  1.061241e+01
##  [8,]  2.541565e+00
##  [9,]  8.979110e-01
## [10,] -6.181574e+00
## [11,] -2.666953e-01
## [12,]  5.468121e-01
## [13,]  5.773160e-14

Chapter 3 - Eigenvalues and Eigenvectors

Introduction to Eigenvalues and Eigenvectors:

  • Eigenvalues and eigenvectors take collections of multi-dimensional objects and select a subset of vectors that closely approximate the originals
  • Matrix-vector multiplication can have many impacts on a vector - reverse, reflect, dilate, contract, project, extract, permutations of these, etc.
  • Eigenvalue-eigenvector analysis allows for summing of the component vectors
    • Scalar c can be multiplied by vector x to create cx - note that cIx = cx
  • Goal is to decompose a matrix in to a few matrices that can be treated similar to scalar multiplication

Definition of Eigenvalues and Eigenvectors:

  • For a matrix A, scalar lambda is considered an eigenvalue of A with associate eigenvector v (v not equal to 0) such that A%%v = lambdav
    • The matrix need not be a diagonal matrix
    • The eigenvalue-eigenvector are often called an eigenpair
    • The eigenvectors stay fixed as the matrix is applied; need not be the simple x and y axes
    • If a eigenpair exists, it can be multiplied by any non-zero scalar; eigenvectors are purely about direction, not about magnitude

Computing Eigenvalues and Eigenvectors in R:

  • An nxn matrix has, up to multiplicity, n eigenvalues
  • The eigenvalues need not be real; some or all may be complex, though complex eigenvalues come in conjugate pairs of a + bi and a-bi
  • Can get the eigenvalues and eigenvectors using eigen()
    • eigen(A)
    • E <- eigen(A)
    • E$values[1] # the first eigenvalue
    • E$vectors[, 1] # the first eigenvector associated to the first eigenvalue
  • Example of matrix with complex eigenvalues
    • A <- matrix(data=c(1, -2, 2, -1), nrow=2, ncol=2, byrow=FALSE)
    • eigen(A) # both the eigenvalues and the eigenvectors will be complex

Some more on Eigenvalues and Eigenvectors:

  • If the eigenvalues lambda(1-to-n) of A are distinct with associated set of eigenvectors v(1-to-n), then this set of vectors forms a basis for the set of all n-dimensional vectors
    • Every n-dimensional vector can be expressed as a linear combination of these eigenvectors
    • Basically, eigenpair turn matrix multiplication in to a linear combination of scalar multiplications
  • If the matrix multiplication is iterated, then the lambdas become raise to the power of the number of iterations
  • The leading eigenvector, when normalized to probability 1, is called the stationary distribution of the Markov matrix model

Example code includes:

A <- matrix(data=c(1, 0, 0, 2/3), nrow=2, ncol=2, byrow=FALSE)

# A is loaded for you
print(A%*%rep(1, 2))
##           [,1]
## [1,] 1.0000000
## [2,] 0.6666667
A <- matrix(data=c(-1, 0, 0, 2, 7, 0, 4, 12, -4), nrow=3, ncol=3, byrow=FALSE)

# Show that 7 is an eigenvalue for A
A%*%c(0.2425356, 0.9701425, 0) - 7*c(0.2425356, 0.9701425, 0)
##       [,1]
## [1,] 2e-07
## [2,] 0e+00
## [3,] 0e+00
# Show that -4 is an eigenvalue for A
A%*%c(-0.3789810, -0.6821657, 0.6253186) - (-4)*c(-0.3789810, -0.6821657, 0.6253186)
##               [,1]
## [1,] -2.220446e-16
## [2,]  5.000000e-07
## [3,]  0.000000e+00
# Show that -1 is an eigenvalue for A
A%*%c(1, 0, 0) - (-1)*c(1, 0, 0)
##      [,1]
## [1,]    0
## [2,]    0
## [3,]    0
# Show the double of the eigenvector
A%*%((2)*c(0.2425356, 0.9701425, 0)) - 7*(2)*c(0.2425356, 0.9701425, 0)
##       [,1]
## [1,] 4e-07
## [2,] 0e+00
## [3,] 0e+00
# Show half of the eigenvector
A%*%((0.5)*c(0.2425356, 0.9701425, 0)) - 7*(0.5)*c(0.2425356, 0.9701425, 0)
##       [,1]
## [1,] 1e-07
## [2,] 0e+00
## [3,] 0e+00
A <- matrix(data=c(1, 1, 2, 1), nrow=2, ncol=2, byrow=FALSE)

# Compute the eigenvalues of A and store in Lambda
Lambda <- eigen(A)

# Print eigenvalues
print(Lambda$values[1])
## [1] 2.414214
print(Lambda$values[2])
## [1] -0.4142136
# Verify that these numbers satisfy the conditions of being an eigenvalue
det(Lambda$values[1]*diag(2) - A)
## [1] -3.140185e-16
det(Lambda$values[2]*diag(2) - A)
## [1] -3.140185e-16
# Find the eigenvectors of A and store them in Lambda
Lambda <- eigen(A)

# Print eigenvectors
print(Lambda$vectors[, 1])
## [1] 0.8164966 0.5773503
print(Lambda$vectors[, 2])
## [1] -0.8164966  0.5773503
# Verify that these eigenvectors & their associated eigenvalues satisfy Av - lambda v = 0
Lambda$values[1]*Lambda$vectors[, 1] - A%*%Lambda$vectors[, 1]
##      [,1]
## [1,]    0
## [2,]    0
Lambda$values[2]*Lambda$vectors[, 2] - A%*%Lambda$vectors[, 2]
##               [,1]
## [1,] -1.110223e-16
## [2,]  8.326673e-17
Mtemp <- matrix(data=c(0.98, 0.005, 0.005, 0.01, 0.005, 0.98, 0.01, 0.005, 0.005, 0.01, 0.98, 0.005, 0.01, 0.005, 0.005, 0.98), nrow=4, ncol=4, byrow=FALSE)
Mtemp
##       [,1]  [,2]  [,3]  [,4]
## [1,] 0.980 0.005 0.005 0.010
## [2,] 0.005 0.980 0.010 0.005
## [3,] 0.005 0.010 0.980 0.005
## [4,] 0.010 0.005 0.005 0.980
# This code iterates mutation 100 times
x <- c(1, 0, 0, 0)
for (j in 1:1000) {x <- Mtemp%*%x}

# Print x
print(x)
##      [,1]
## [1,] 0.25
## [2,] 0.25
## [3,] 0.25
## [4,] 0.25
# Print and scale the first eigenvector of M
Lambda <- eigen(M)
v1 <- Lambda$vectors[, 1]/sum(Lambda$vectors[, 1])

print(v1)
##  [1] -4.230857e+12+0i  4.505406e+13+0i -1.387265e+13+0i  7.797797e+13+0i
##  [5]  1.204798e+13+0i -6.278774e+14+0i  6.232749e+14+0i  1.751380e+13+0i
##  [9]  1.575987e+14+0i -1.006767e+13+0i -5.480824e+13+0i -2.226106e+14+0i
## [13]  4.683250e-03+0i

Chapter 4 - Principal Component Analysis

Introduction to the Idea of PCA (Principal Component Analysis):

  • PCA is a common dimension-reduction technique in machine learning and data science
  • Real-world data often has many rows (observations) and also many columns (features)
    • More rows is almost always better, but more columns can be problematic (especially if they are correlated)
  • PCA is a useful applied method for linear algebra
    • Non-parametric manner of extracting information from confusing data sets
    • Uncovers hidden, low-dimensional structures that underlie data
    • Often easier to visualize and interpret

Linear Algebra Behind PCA:

  • The matrix t(A) is made by interchanging the rows and columns of A
  • Suppose that the matrix A has n rows and also has had each column mean-adjusted to 0
    • Then, t(A)%*%A / (n-1) is the covariance matrix where the value in cell (I, j) is the covariance between columns I and j of matrix A
    • As such, cell(I, i) - the diagonal - contains the variance of the i-column
    • Note that t(A)%*%A is always a square matrix
  • Suppose that A <- matrix(data=c(1:5, 2*(1:5)), nrow=5, ncol=2, byrow=FALSE)
    • A[, 1] <- A[, 1] - mean(A[, 1])
    • A[, 2] <- A[, 2] - mean(A[, 2])
    • t(A)%*%A/(nrow(A) - 1)
  • The total variance of the dataset is the sum of the eigenvalues of t(A) %*% A / (n-1)
    • The associated eigenvector are called the principal components of the data

Performing PCA in R:

  • Can run the PCA using prcomp()
    • prcomp(A) # prints the standard deviations, which are just the square roots of the variances from the previous example
    • summary(prcomp(A))
  • Can extract the principal components and then apply them to the data
    • head(prcomp(A)$x[, 1:2])
    • head(cbind(combine[, 1:4], prcomp(A)$x[, 1:2]))
  • Can run many checks after PCA
    • Data wrangling and quality control - perhaps there is a main grouping variable such as position
    • Data visualization may be much easier in 2-dimensions, though PCA is generally better for clustering
    • Can use PCA as an input for supervised learning, since by definition there is no redundancy remaining after PCA

Wrap Up:

  • Vectors and matrices and their interactions
  • Matrix-vector equations and their solutions
  • Eigenvalues and eigenvectors
  • PCA and applications to multivariate datasets

Example code inludes:

combine <- readr::read_csv("./RInputFiles/DataCampCombine.csv")
## Parsed with column specification:
## cols(
##   player = col_character(),
##   position = col_character(),
##   school = col_character(),
##   year = col_double(),
##   height = col_double(),
##   weight = col_double(),
##   forty = col_double(),
##   vertical = col_double(),
##   bench = col_double(),
##   broad_jump = col_double(),
##   three_cone = col_double(),
##   shuttle = col_double(),
##   drafted = col_character()
## )
glimpse(combine)
## Observations: 2,885
## Variables: 13
## $ player     <chr> "Jaire Alexander", "Brian Allen", "Mark Andrews", "...
## $ position   <chr> "CB", "C", "TE", "S", "EDGE", "DE", "WR", "CB", "IL...
## $ school     <chr> "Louisville", "Michigan St.", "Oklahoma", "Penn St....
## $ year       <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 201...
## $ height     <dbl> 71, 73, 77, 74, 76, 78, 76, 72, 73, 73, 75, 76, 80,...
## $ weight     <dbl> 192, 298, 256, 198, 257, 262, 216, 185, 248, 228, 2...
## $ forty      <dbl> 4.38, 5.34, 4.67, 4.34, 4.87, 4.60, 4.62, 4.36, 4.5...
## $ vertical   <dbl> 35.0, 26.5, 31.0, 41.0, 30.0, 38.5, 34.0, 31.5, 36....
## $ bench      <dbl> 14, 27, 17, 16, 20, 18, 13, 13, 26, 15, 16, 31, 14,...
## $ broad_jump <dbl> 127, 99, 113, 131, 118, 128, 121, 119, 124, 122, 11...
## $ three_cone <dbl> 6.71, 7.81, 7.34, 6.56, 7.12, 7.53, 7.07, 6.93, 6.9...
## $ shuttle    <dbl> 3.98, 4.71, 4.38, 4.03, 4.23, 4.48, 4.25, 4.40, 4.3...
## $ drafted    <chr> "Green Bay Packers / 1st / 18th pick / 2018", "Los ...
# Print the first few observations of the dataset
head(combine)
## # A tibble: 6 x 13
##   player position school  year height weight forty vertical bench
##   <chr>  <chr>    <chr>  <dbl>  <dbl>  <dbl> <dbl>    <dbl> <dbl>
## 1 Jaire~ CB       Louis~  2018     71    192  4.38     35      14
## 2 Brian~ C        Michi~  2018     73    298  5.34     26.5    27
## 3 Mark ~ TE       Oklah~  2018     77    256  4.67     31      17
## 4 Troy ~ S        Penn ~  2018     74    198  4.34     41      16
## 5 Doran~ EDGE     Kansas  2018     76    257  4.87     30      20
## 6 Ade A~ DE       Tulane  2018     78    262  4.6      38.5    18
## # ... with 4 more variables: broad_jump <dbl>, three_cone <dbl>,
## #   shuttle <dbl>, drafted <chr>
# Find the correlation between variables forty and three_cone
cor(combine$forty, combine$three_cone)
## [1] 0.8315171
# Find the correlation between variables vertical and broad_jump
cor(combine$vertical, combine$broad_jump)
## [1] 0.8163375
# Extract columns 5-12 of combine
A <- combine[, 5:12]

# Take the matrix of A
A <- as.matrix(A)

# Subtract the mean of all columns
A[, 1] <- A[, 1] - mean(A[, 1])
A[, 2] <- A[, 2] - mean(A[, 2])
A[, 3] <- A[, 3] - mean(A[, 3])
A[, 4] <- A[, 4] - mean(A[, 4])
A[, 5] <- A[, 5] - mean(A[, 5])
A[, 6] <- A[, 6] - mean(A[, 6])
A[, 7] <- A[, 7] - mean(A[, 7])
A[, 8] <- A[, 8] - mean(A[, 8])


# Create matrix B from equation in instructions
B <- t(A)%*%A/(nrow(A) - 1)

# Compare 1st element of B to 1st column of variance of A
B[1,1]
## [1] 7.159794
var(A[, 1])
## [1] 7.159794
# Compare 1st element of 2nd column and row element of B to 1st and 2nd columns of A 
B[1, 2]
## [1] 90.78808
B[2, 1]
## [1] 90.78808
cov(A[, 1], A[, 2])
## [1] 90.78808
# Find eigenvalues of B
V <- eigen(B)

# Print eigenvalues
V$values
## [1] 2.187628e+03 4.403246e+01 2.219205e+01 5.267129e+00 2.699702e+00
## [6] 6.317016e-02 1.480866e-02 1.307283e-02
# Scale columns 5-12 of combine
B <- scale(combine[, 5:12])

# Print the first few rows of the data
head(B)
##           height      weight      forty   vertical      bench  broad_jump
## [1,] -1.11844839 -1.30960025 -1.3435337  0.5624657 -1.1089286  1.45502476
## [2,] -0.37100257  1.00066356  1.6449741 -1.4281627  0.9238361 -1.49512459
## [3,]  1.12388907  0.08527601 -0.4407553 -0.3743006 -0.6398290 -0.02004991
## [4,]  0.00272034 -1.17883060 -1.4680548  1.9676151 -0.7961955  1.87647467
## [5,]  0.75016616  0.10707096  0.1818505 -0.6084922 -0.1707295  0.50676247
## [6,]  1.49761199  0.21604566 -0.6586673  1.3821362 -0.4834625  1.56038724
##       three_cone    shuttle
## [1,] -1.38083506 -1.5879750
## [2,]  1.16888714  1.1170258
## [3,]  0.07946038 -0.1057828
## [4,] -1.72852445 -1.4027010
## [5,] -0.43048406 -0.6616049
## [6,]  0.51986694  0.2647653
# Summarize the principal component analysis
summary(prcomp(B))
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.3679 0.9228 0.78904 0.61348 0.46811 0.37178
## Proportion of Variance 0.7009 0.1064 0.07782 0.04704 0.02739 0.01728
## Cumulative Proportion  0.7009 0.8073 0.88514 0.93218 0.95957 0.97685
##                            PC7     PC8
## Standard deviation     0.34834 0.25266
## Proportion of Variance 0.01517 0.00798
## Cumulative Proportion  0.99202 1.00000
# Subset combine only to "WR"
combine_WR <- subset(combine, position == "WR")

# Scale columns 5-12 of combine
B <- scale(combine_WR[, 5:12])

# Print the first few rows of the data
head(B)
##          height      weight       forty   vertical      bench  broad_jump
## [1,]  1.4022982  0.88324903  1.20674474 -0.3430843 -0.3223377  0.07414249
## [2,]  0.5575402 -0.09700717 -0.80129388 -0.4969965 -0.7938424 -0.95388361
## [3,]  0.9799192  1.58343202  0.88968601  1.0421255  0.8564239  1.61618163
## [4,]  0.9799192  1.16332222  1.41811723 -1.5743819 -0.7938424 -1.29655897
## [5,] -1.1319757 -1.56739147 -0.80129388 -0.1891721 -0.0865854 -1.29655897
## [6,]  0.1351613  0.11304773  0.04419607  0.2725645 -1.0295947  0.24548017
##        three_cone     shuttle
## [1,]  0.712845019  0.02833449
## [2,] -1.098542478  0.84141123
## [3,] -1.853287268 -1.46230619
## [4,] -1.148858797  0.50262926
## [5,]  0.008416548 -0.64922946
## [6,]  0.109049187  0.84141123
# Summarize the principal component analysis
summary(prcomp(B))
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.5425 1.4255 1.0509 0.9603 0.77542 0.63867 0.59792
## Proportion of Variance 0.2974 0.2540 0.1380 0.1153 0.07516 0.05099 0.04469
## Cumulative Proportion  0.2974 0.5514 0.6894 0.8047 0.87987 0.93085 0.97554
##                            PC8
## Standard deviation     0.44235
## Proportion of Variance 0.02446
## Cumulative Proportion  1.00000

HR Analytics in R: Predicting Employee Churn

Chapter 1 - Introduction

Turnover:

  • Employee turnover is about an employee exiting the company (churn, turnover, attrition are roughly equivalent)
    • Nearly 25% of US workers leave their jobs every year
    • Replacing an employee can cost ~50% of the employee’s first-year salary
  • Turnover may be voluntary (resignation) or involuntary (firing, end of contract, etc.)
    • Stated and actual drivers of turnover may not always be the same

Exploring the data:

  • Can begin with data exploration of the organizational data
    • emp_id is the employee ID
    • status is “active” or “inactive”
    • turnover is 1 if they left, 0 otherwise
    • cutoff_date is the study end date
  • Turnover rate is the percentage of employees who left the organization in a given period of time
    • org %>% count(status)
    • org %>% summarize(turnover_rate = mean(turnover))
    • df_level <- org %>% group_by(level) %>% summarize(turnover_level = mean(turnover))
    • ggplot(df_level, aes(x = level, y = turnover_level)) + geom_col()

HR data architecture:

  • May want to identify segments of the data for more meaningful insights and interventions
    • org2 <- org %>% filter(level %in% c(“Analyst”, “Specialist”))
  • There are many data sources that may need to be integrated for employee analysis
    • df3 <- left_join(df1, df2, by = “emp_id”)

Example code includes:

# Load the readr and dplyr packages
library(readr)
library(dplyr)

# Import the org data
org <- read_csv("org.csv")

# Check the structure of org dataset, the dplyr way
glimpse(org)


# Count Active and Inactive employees
org %>% 
  count(status)

# Calculate turnover rate
org %>% 
  summarize(avg_turnover_rate = mean(turnover))


# Calculate level wise turnover rate
df_level <- org %>% 
  group_by(level) %>% 
  summarize(turnover_level = mean(turnover))

# Check the results
df_level

# Visualize the results
ggplot(df_level, aes(x = level, y = turnover_level)) + 
  geom_col()


# Calculate location wise turnover rate
df_location <- org %>% 
  group_by(location) %>% 
  summarize(turnover_location = mean(turnover))

# Check the results
df_location

# Visualize the results
ggplot(df_location, aes(x = location, y = turnover_location)) +
  geom_col()


# Count the number of employees across levels
org %>% 
  count(level)

# Select the employees at Analyst and Specialist level
org2 <- org %>%
  filter(level %in% c("Analyst", "Specialist")) 

# Validate the results
org2 %>% 
  count(level)


# View the structure of rating dataset
glimpse(rating)

# Complete the code to join rating to org2 dataset
org3 <- left_join(org2, rating, by = "emp_id")

# Calculate rating wise turnover rate
df_rating <- org3 %>% 
  group_by(rating) %>% 
  summarize(turnover_rating = mean(turnover))

# Check the result
df_rating


# View the structure of survey dataset
glimpse(survey)

# Complete the code to join survey to org3 dataset
org_final <- left_join(org3, survey, by="mgr_id")

# Compare manager effectiveness scores
ggplot(org_final, aes(x = status, y = mgr_effectiveness)) +
  geom_boxplot()


# View the structure of the dataset
glimpse(org_final)

# Number of variables in the dataset
variables <- ncol(org_final)

# Compare the travel distance of Active and Inactive employees
ggplot(org_final, aes(x = status, y = distance_from_home)) +
  geom_boxplot()

Chapter 2 - Feature Engineering

Feature engineering:

  • Can create additional variables from the variables already available in the dataset
  • Job-hopping is described as frequently switching jobs; often high turnover
  • Can computer tenure and timespan
    • org_final %>% mutate(date_of_joining = dmy(date_of_joining), cutoff_date = dmy(cutoff_date), last_working_date = dmy(last_working_date))
    • date_1 <- ymd(“2000-01-01”)
    • date_2 <- ymd(“2014-08-09”)
    • time_length(interval(date_1, date_2), “years”)

Compensation:

  • Compensation is a key factor driving turnover, and there are many drivers
    • ggplot(emp_tenure, aes(x = compensation)) + geom_histogram()
    • ggplot(emp_tenure, aes(x = level, y = compensation)) + geom_boxplot()
  • The compa-ratio is the actual compensation divided by the median compensation
    • emp_compa_ratio <- emp_tenure %>% group_by(level) %>% mutate(median_compensation = median(compensation), compa_ratio = (compensation / median_compensation))
    • emp_compa_ratio %>% distinct(level, median_compensation)

Information value:

  • Can examine the “information value” for each of the independent variables on the dependent variables
    • Measure of the predictive power of the independent variable on the response
    • IV <- Information::create_infotables(data = emp_final, y = “turnover”) # y is the target (dependent) variable
    • IV$Summary
  • Generally, information value >0.4 is strong, <0.15 is weak, and 0.15 < x < 0.4 is moderate

Example code includes:

# Add age_diff
emp_age_diff <- org_final %>%
  mutate(age_diff = mgr_age - emp_age)

# Plot the distribution of age difference
ggplot(emp_age_diff, aes(x = status, y = age_diff)) + 
  geom_boxplot()


# Add job_hop_index
emp_jhi <- emp_age_diff %>% 
  mutate(job_hop_index = ifelse(no_previous_companies_worked != 0,  total_experience / no_previous_companies_worked, 0))

# Compare job hopping index of Active and Inactive employees
ggplot(emp_jhi, aes(x = status, y = job_hop_index)) + 
  geom_boxplot()


# Add tenure
emp_tenure <- emp_jhi %>%
  mutate(tenure = ifelse(status == "Active", 
                         time_length(interval(date_of_joining, cutoff_date), 
                                     "years"), 
                         time_length(interval(date_of_joining, last_working_date), 
                                     "years")))

# Compare tenure of active and inactive employees
ggplot(emp_tenure, aes(x = status, y = tenure)) + 
  geom_boxplot()


# Plot the distribution of compensation
ggplot(emp_tenure, aes(x = compensation)) + 
  geom_histogram()

# Plot the distribution of compensation across levels
ggplot(emp_tenure, 
       aes(x = level, y = compensation)) +
  geom_boxplot()

# Compare compensation of Active and Inactive employees across levels
ggplot(emp_tenure, 
       aes(x = level, y = compensation, fill = status)) + 
  geom_boxplot()


# Add median_compensation and compa_ratio
emp_compa_ratio <- emp_tenure %>%  
  group_by(level) %>%   
  mutate(median_compensation = median(compensation), 
         compa_ratio = compensation / median_compensation)

# Look at the median compensation for each level           
emp_compa_ratio %>% 
  distinct(level, median_compensation)


# Add compa_level
emp_final <- emp_compa_ratio %>%  
  mutate(compa_level = ifelse(compa_ratio > 1, "Above", "Below"))

# Compare compa_level for Active & Inactive employees
ggplot(emp_final, aes(x = status, fill = compa_level)) + 
  geom_bar(position = "fill")


# Load Information package
library(Information)

# Compute Information Value 
IV <- create_infotables(data = emp_final, y = "turnover")

# Print Information Value 
IV$Summary

Chapter 3 - Predicting Turnover

Data Splitting:

  • Can split the data in to test and train, then build the model on the train data and confirm out-of-sample data on the (previously unseen) test data
    • index_train <- caret::createDataPartition(emp_final$turnover, p = 0.5, list = FALSE)
    • train_set <- emp_final[index_train, ]
    • test_set <- emp_final[-index_train, ]

Introduction to Logistic Regression:

  • Can use logistic regression to predict the probability of employee turnover
    • Categorize data based on the independent variable
    • Target class for this case is turnover
    • simple_log <- glm(turnover ~ emp_age, family = “binomial”, data = train_set)
    • summary(simple_log)
  • May want to remove some of the independent variables that are non-relevant
    • train_set_multi <- train_set %>% select(-c(emp_id, mgr_id, date_of_joining, last_working_date, cutoff_date, mgr_age, emp_age, median_compensation, department, status))
    • multi_log <- glm(turnover ~ ., family = “binomial”, data = train_set_multi)

Multicollinearity:

  • Strongly related variables do not provide new, linear information
  • Correlation is the measure of linear association between two variables, between -1 and +1
    • cor(train_set\(emp_age, train_set\)compensation)
  • Multicollinearity is where a variable is correlated with 2+ of the other variable
  • The VIF (variance inflation factor) can be assessed using car::vif() - considered to be highly correlated for VIF >= 5
    • multi_log <- glm(turnover ~ ., family = “binomial”, data = train_set_multi)
    • car::vif(multi_log)
  • Generally, remove the highest VIF variable (assuming greater than 5) model, then re-run, the repeat
    • new_model <- glm(dependent_variable ~ . - variable_to_remove, family = “binomial”, data = dataset)

Building final model:

  • Can then build the final model based on the relevant, independent variables
  • Can use the model to calculate probabilities
    • prediction_train <- predict(final_log, newdata = train_set_final, type = “response”) # type=“response” gives a vector of probabilities
  • Generally a best practice to explore the range of the probabilities
    • hist(prediction_train)
    • prediction_test <- predict(final_log, newdata = test_set, type = “response”)

Example code includes:

# Load caret
library(caret)

# Set seed of 567
set.seed(567)

# Store row numbers for training dataset: index_train
index_train <- createDataPartition(emp_final$turnover, p = 0.7, list = FALSE)

# Create training dataset: train_set
train_set <- emp_final[index_train, ]

# Create testing dataset: test_set
test_set <- emp_final[-index_train, ]


# Calculate turnover proportion in train_set
train_set %>% 
  count(status) %>% 
  mutate(prop = n / sum(n))

# Calculate turnover proportion in test_set
test_set %>% 
  count(status) %>% 
  mutate(prop = n / sum(n))


# Build a simple logistic regression model
simple_log <- glm(turnover ~ percent_hike, 
                  family = "binomial", data = train_set_multi)

# Print summary
summary(simple_log)


# Build a multiple logistic regression model
multi_log <- glm(turnover ~ ., family = "binomial", 
                 data = train_set_multi)

# Print summary
summary(multi_log)


# Load the car package
library(car)

# Model you built in a previous exercise
multi_log <- glm(turnover ~ ., family = "binomial", data = train_set_multi)

# Check for multicollinearity
vif(multi_log)

# Which variable has the highest VIF?
highest <- "level"


# Remove level
model_1 <- glm(turnover ~ . - level, family = "binomial", 
               data = train_set_multi)

# Check multicollinearity again
vif(model_1)

# Which variable has the highest VIF value?
highest <- "compensation"

# Remove level & compensation
model_2 <- glm(turnover ~ . - level - compensation, family = "binomial", 
               data = train_set_multi)

# Check multicollinearity again
vif(model_2)

# Does any variable have a VIF greater than 5?
highest <- FALSE


# Build the final logistic regression model
final_log <- glm(turnover ~ ., family = "binomial", data=train_set_final)

# Print summary 
summary(final_log)


# Make predictions for training dataset
prediction_train <- predict(final_log, newdata = train_set, type = "response")

# Look at the prediction range
hist(prediction_train)

# Make predictions for testing dataset
prediction_test <- predict(final_log, newdata = test_set, type = "response")

# Look at the prediction range
hist(prediction_test)

# Print the probaility of turnover
prediction_test[c(150, 200)]

Chapter 4 - Model Validation, HR Interventions, and ROI

Validating logistic regression results:

  • Need to use a cutoff to convert probabilities to binary decisions
    • pred_cutoff_50_test <- ifelse(predictions_test > 0.5, 1, 0)
  • The confusion matrix will assess the number of correct predictions made by the model
    • TP - true positives
    • TN - true positives
    • FP - predicted positive, actually negative
    • FN - predicted negative, actually positive
  • Can look at metrics for the confusion matrix
    • Accuracy = (TP + TN) / (TP + TN + FP + FN)
    • conf_matrix_50 <- confusionMatrix(table(test_set$turnover, pred_cutoff_50_test))

Designing retention strategy:

  • May want to design retention strategies based on the predicted likelihoods of leaving
    • Calculate probabilitiy of turnover only for the active employees
    • emp_risk <- emp_final %>% filter(status==“Active”) %>% tidypredict::tidypredict_to_column(final_log)
    • emp_risk %>% select(emp_id, fit) %>% top_n(5, wt = fit)
  • May want to then bucket the risk of leaving for each employee
    • emp_risk_bucket <- emp_risk %>% mutate(risk_bucket = cut(fit, breaks = c(0, 0.5, 0.6, 0.8, 1), labels = c(“no-risk”, “low-risk”, “medium-risk”, “high-risk”)))
  • Can then prioritize retention strategies for employees in the higher turnover risk buckets

Return on investment:

  • Can calculate ROI based on the retention strategies employed (costs and outcomes)

Wrap up:

  • Basics of turnover analysis
  • Integrating data
  • Modeling data
  • Making conclusions

Example code includes:

# Classify predictions using a cut-off of 0.5
prediction_categories <- ifelse(prediction_test > 0.5, 1, 0)

# Construct a confusion matrix
conf_matrix <- table(prediction_categories, test_set$turnover)
conf_matrix


# Load caret
library(caret)

# Call confusionMatrix
confusionMatrix(conf_matrix)

# What is the accuracy?
accuracy <- round(unname(confusionMatrix(conf_matrix)$overall["Accuracy"]), 3)


# Load tidypredict 
library(tidypredict)

# Calculate probability of turnover
emp_risk <- emp_final %>%
  filter(status == "Active") %>%
  tidypredict_to_column(final_log)

# Run the code
emp_risk %>% 
  select(emp_id, fit) %>% 
  top_n(2)


# Create turnover risk buckets
emp_risk_bucket <- emp_risk %>% 
  mutate(risk_bucket = cut(fit, breaks = c(0, 0.5, 0.6, 0.8, 1), 
                           labels = c("no-risk", "low-risk", 
                                      "medium-risk", "high-risk")))

# Count employees in each risk bucket
emp_risk_bucket %>% 
  count(risk_bucket)


# Plot histogram of percent hike
ggplot(emp_final, aes(x = percent_hike)) +
  geom_histogram(binwidth = 3)

# Create salary hike_range of Analyst level employees
emp_hike_range <- emp_final %>% 
  filter(level == "Analyst") %>% 
  mutate(hike_range = cut(percent_hike, breaks = c(0, 10, 15, 20),
                          include.lowest = TRUE, 
                          labels = c("0 to 10", "11 to 15", "16 to 20")
                          )
        )


# Calculate the turnover rate for each salary hike range
df_hike <- emp_hike_range %>% 
  group_by(hike_range) %>% 
  summarize(turnover_rate_hike = mean(turnover))

# Check the results
df_hike

# Visualize the results
ggplot(df_hike, aes(x = hike_range, y = turnover_rate_hike)) + 
  geom_col()


# Compute extra cost
extra_cost <- median_salary_analyst * (0.05)

# Compute savings
savings <-  turnover_cost * 0.15

# Calculate ROI
ROI <- (savings / extra_cost) * 100

# Print ROI
cat(paste0("The return on investment is ", round(ROI), "%!"))

Dealing with Missing Data in R

Chapter 1 - Why Care About Missing Data?

Introduction to Missing Data:

  • Need to be able to work with missing data; common in real-world applications
  • Imputation needs to be done carefully; just plugging in the mean or median may lead to quirky results
  • Missing values are values that should have been recorded but were not
    • naniar::any_na(x)
    • naniar::are_na(x)
    • naniar::n_miss(x)
    • naniar::prop_miss(x)
  • Working with missing data
    • NA + = NA
  • Can also have NaN, which is interpreted by R as a missing number
    • naniar::any_na(NaN) # TRUE
  • NULL is an empty value that is not missing
    • naniar::any_na(NULL) # FALSE
  • Be careful about boolean operations
    • NA | TRUE # TRUE
    • NA | FALSE # FALSE

Why care about missing values?

  • Basic summaries of missingness return a single number - n_miss() or n_complete() for example
  • The naniar library has a series of functions at various granularity that all start with miss_*()
    • miss_var_summary(airquality)
    • miss_case_summary(airquality)
    • miss_var_table(airquality)
    • miss_case_table(airquality)
  • Can look at data over a specific span or run
    • miss_var_span(pedestrian, var = hourly_counts, span_every = 4000) # each span of 4000 is treated as a group, with statistics reported
    • miss_var_run(pedestrian, hourly_counts) # streaks of missingness; length of each run (repeating patterns)
    • airquality %>% group_by(Month) %>% miss_var_summary()

How to visualize missing values?

  • Visualizing the missingness of the data can help highlight issues
  • Can get a bird’s-eye view of the data, including spans and groups
    • vis_miss(airquality)
    • vis_miss(airquality, cluster = TRUE)
    • gg_miss_var(airquality)
    • gg_miss_case(airquality)
    • gg_miss_var(airquality, facet = Month)
    • gg_miss_upset(airquality) # co-occurrence of missing data
    • gg_miss_fct(x = airquality, fct = Month) # plots by factor
    • gg_miss_span(pedestrian, hourly_counts, span_every = 3000)

Example code includes:

# Create x, a vector, with values NA, NaN, Inf, ".", and "missing"
x <- c(NA, NaN, Inf, ".", "missing")

# Use any_na() and are_na() on to explore the missings
naniar::any_na(x)
## [1] TRUE
naniar::are_na(x)
## [1]  TRUE FALSE FALSE FALSE FALSE
dat_hw <- data.frame(weight=c(95.16, NA, 102.82, 80.98, 112.91, 94, 105.43, 77.79, NA, 98.93, 68.26, 94.16, 105.32, 61.4, 72.89, 85.67, NA, 63.3, 98.98, 72.17, NA, 103.63, 87.52, 89.78, 103.03, 97.26, 82.77, 68.27, 92.93, 74.55, 61.55, 86.09, 80.04, 88.78, 76.25, 80.44, 99.37, 84.21, NA, 88.5, 97.34, 95.35, 91.91, 78.76, NA, 101.57, 68.33, 89.75, 90.96, 87.17, 104.96, NA, 72.18, 74.09, NA, 92.65, 79.61, 110.09, 77.67, 87.46, 66.91, 76.59, 84.96, 80.21, NA, 64.15, 55.14, NA, 84.47, 100.97, NA, 83.26, 42.15, 89.25, 92.04, NA, 72.76, 69.67, 80.37, NA, 58.38, 84.34, 62.84, NA, 94.23, 83.48, 75.54, 79.93, 79.66, NA, 97.61, 77.11, 83.92, 104.56, 105.94, 107.15, 45.75, 76.61, 88.29, 93.05), height=c(1.95, 2.35, 1.64, 2.47, 1.92, 1.9, 0.83, 2.7, 1.98, 1.83, 0.24, NA, 1.67, NA, 2.03, 2.78, 0.59, 1.99, 2.34, 1.99, -0.05, 0.36, NA, 0.88, NA, 1.37, 2.62, 0.71, 0.52, -0.12, 2.25, 1.06, 1.99, 0.94, -1.11, 1.23, 1.31, 2, 1.1, 0.55, 1.84, 2.14, NA, NA, 1.94, 0.66, 0.47, 2.37, 3.4, 1.4, 2.52, 0.15, 2.42, 0.47, NA, 1.08, 1.89, 2.92, 2.71, NA, 2.72, NA, NA, 1.76, 0.73, 1.84, -0.09, 3.62, 2.34, 0.61, 2.15, 0.39, 0.92, NA, 1.41, 0, 3.51, NA, 0.18, 1.31, 1.19, 2.81, 3.32, 0.06, 3.44, NA, 1.32, NA, 2.46, 3.09, 0.13, 0.92, 0.16, 0.88, 1.38, 0.28, 2.51, NA, 1.05, 3.16))

# Use n_miss() to count the total number of missing values in dat_hw
naniar::n_miss(dat_hw)
## [1] 30
# Use n_miss() on dat_hw$weight to count the total number of missing values
naniar::n_miss(dat_hw$weight)
## [1] 15
# Use n_complete() on dat_hw to count the total number of complete values
naniar::n_complete(dat_hw)
## [1] 170
# Use n_complete() on dat_hw$weight to count the total number of complete values
naniar::n_complete(dat_hw$weight)
## [1] 85
# Use prop_miss() and prop_complete() on dat_hw to count the total number of missing values in each of the variables
naniar::prop_miss(dat_hw)
## [1] 0.15
naniar::prop_complete(dat_hw)
## [1] 0.85
data(airquality)
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
# Summarise missingness in each variable of the `airquality` dataset
naniar::miss_var_summary(airquality)
## # A tibble: 6 x 3
##   variable n_miss pct_miss
##   <chr>     <int>    <dbl>
## 1 Ozone        37    24.2 
## 2 Solar.R       7     4.58
## 3 Wind          0     0   
## 4 Temp          0     0   
## 5 Month         0     0   
## 6 Day           0     0
# Summarise missingness in each case of the `airquality` dataset
naniar::miss_case_summary(airquality)
## # A tibble: 153 x 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1     5      2     33.3
##  2    27      2     33.3
##  3     6      1     16.7
##  4    10      1     16.7
##  5    11      1     16.7
##  6    25      1     16.7
##  7    26      1     16.7
##  8    32      1     16.7
##  9    33      1     16.7
## 10    34      1     16.7
## # ... with 143 more rows
# Return the summary of missingness in each variable, grouped by Month, in the `airquality` dataset
airquality %>% 
    group_by(Month) %>% 
    naniar::miss_var_summary()
## # A tibble: 25 x 4
##    Month variable n_miss pct_miss
##    <int> <chr>     <int>    <dbl>
##  1     5 Ozone         5     16.1
##  2     5 Solar.R       4     12.9
##  3     5 Wind          0      0  
##  4     5 Temp          0      0  
##  5     5 Day           0      0  
##  6     6 Ozone        21     70  
##  7     6 Solar.R       0      0  
##  8     6 Wind          0      0  
##  9     6 Temp          0      0  
## 10     6 Day           0      0  
## # ... with 15 more rows
# Return the summary of missingness in each case, grouped by Month, in the `airquality` dataset
airquality %>% 
    group_by(Month) %>% 
    naniar::miss_case_summary()
## # A tibble: 153 x 4
##    Month  case n_miss pct_miss
##    <int> <int>  <int>    <dbl>
##  1     5     5      2       40
##  2     5    27      2       40
##  3     5     6      1       20
##  4     5    10      1       20
##  5     5    11      1       20
##  6     5    25      1       20
##  7     5    26      1       20
##  8     5     1      0        0
##  9     5     2      0        0
## 10     5     3      0        0
## # ... with 143 more rows
# Tabulate missingness in each variable and case of the `airquality` dataset
naniar::miss_var_table(airquality)
## # A tibble: 3 x 3
##   n_miss_in_var n_vars pct_vars
##           <int>  <int>    <dbl>
## 1             0      4     66.7
## 2             7      1     16.7
## 3            37      1     16.7
naniar::miss_case_table(airquality)
## # A tibble: 3 x 3
##   n_miss_in_case n_cases pct_cases
##            <int>   <int>     <dbl>
## 1              0     111     72.5 
## 2              1      40     26.1 
## 3              2       2      1.31
# Tabulate the missingness in each variable, grouped by Month, in the `airquality` dataset
airquality %>% 
    group_by(Month) %>% 
    naniar::miss_var_table()
## # A tibble: 12 x 4
##    Month n_miss_in_var n_vars pct_vars
##    <int>         <int>  <int>    <dbl>
##  1     5             0      3       60
##  2     5             4      1       20
##  3     5             5      1       20
##  4     6             0      4       80
##  5     6            21      1       20
##  6     7             0      4       80
##  7     7             5      1       20
##  8     8             0      3       60
##  9     8             3      1       20
## 10     8             5      1       20
## 11     9             0      4       80
## 12     9             1      1       20
# Tabulate of missingness in each case, grouped by Month, in the `airquality` dataset
airquality %>% 
    group_by(Month) %>% 
    naniar::miss_case_table()
## # A tibble: 11 x 4
##    Month n_miss_in_case n_cases pct_cases
##    <int>          <int>   <int>     <dbl>
##  1     5              0      24     77.4 
##  2     5              1       5     16.1 
##  3     5              2       2      6.45
##  4     6              0       9     30   
##  5     6              1      21     70   
##  6     7              0      26     83.9 
##  7     7              1       5     16.1 
##  8     8              0      23     74.2 
##  9     8              1       8     25.8 
## 10     9              0      29     96.7 
## 11     9              1       1      3.33
data(pedestrian, package="naniar")
str(pedestrian)
## Classes 'tbl_df', 'tbl' and 'data.frame':    37700 obs. of  9 variables:
##  $ hourly_counts: int  883 597 294 183 118 68 47 52 120 333 ...
##  $ date_time    : POSIXct, format: "2016-01-01 00:00:00" "2016-01-01 01:00:00" ...
##  $ year         : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ month        : Ord.factor w/ 12 levels "January"<"February"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ month_day    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ week_day     : Ord.factor w/ 7 levels "Sunday"<"Monday"<..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ hour         : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ sensor_id    : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ sensor_name  : chr  "Bourke Street Mall (South)" "Bourke Street Mall (South)" "Bourke Street Mall (South)" "Bourke Street Mall (South)" ...
library(naniar)


# need to add so that the RLE can be converted to data.frame in naniar::miss_var_run
as.data.frame.rle <- function(x, ...) do.call(data.frame, x)


# Calculate the summaries for each run of missingness for the variable `hourly_counts`
naniar::miss_var_run(pedestrian, var = hourly_counts)
## # A tibble: 35 x 2
##    run_length is_na   
##         <int> <chr>   
##  1       6628 complete
##  2          1 missing 
##  3       5250 complete
##  4        624 missing 
##  5       3652 complete
##  6          1 missing 
##  7       1290 complete
##  8        744 missing 
##  9       7420 complete
## 10          1 missing 
## # ... with 25 more rows
# Calculate the summaries for each span of missingness, for a span of 4000, for the variable `hourly_counts`
naniar::miss_var_span(pedestrian, var = "hourly_counts", span_every = 4000)
## # A tibble: 10 x 5
##    span_counter n_miss n_complete prop_miss prop_complete
##           <int>  <int>      <dbl>     <dbl>         <dbl>
##  1            1      0       4000         0             1
##  2            2      0       4000         0             1
##  3            3      0       4000         0             1
##  4            4      0       4000         0             1
##  5            5      0       4000         0             1
##  6            6      0       4000         0             1
##  7            7      0       4000         0             1
##  8            8      0       4000         0             1
##  9            9      0       4000         0             1
## 10           10      0       4000         0             1
# For each `month` variable, calculate the run of missingness for `hourly_counts`
pedestrian %>% 
    group_by(month) %>% 
    naniar::miss_var_run(var = "hourly_counts")
## # A tibble: 51 x 3
##    month    run_length is_na   
##    <ord>         <int> <chr>   
##  1 January        2976 complete
##  2 February       2784 complete
##  3 March          2976 complete
##  4 April           888 complete
##  5 April           552 missing 
##  6 April          1440 complete
##  7 May             744 complete
##  8 May              72 missing 
##  9 May            2160 complete
## 10 June           2880 complete
## # ... with 41 more rows
# For each `month` variable, calculate the span of missingness of a span of 2000
pedestrian %>% 
    group_by(month) %>% 
    naniar::miss_var_span(var = "hourly_counts", span_every = 2000)
## # A tibble: 25 x 6
##    month    span_counter n_miss n_complete prop_miss prop_complete
##    <ord>           <int>  <int>      <dbl>     <dbl>         <dbl>
##  1 January             1      0       2000         0             1
##  2 January             2      0       2000         0             1
##  3 February            1      0       2000         0             1
##  4 February            2      0       2000         0             1
##  5 March               1      0       2000         0             1
##  6 March               2      0       2000         0             1
##  7 April               1      0       2000         0             1
##  8 April               2      0       2000         0             1
##  9 May                 1      0       2000         0             1
## 10 May                 2      0       2000         0             1
## # ... with 15 more rows
data(riskfactors, package="naniar")
str(riskfactors)
## Classes 'tbl_df', 'tbl' and 'data.frame':    245 obs. of  34 variables:
##  $ state           : Factor w/ 52 levels "1","2","5","6",..: 22 36 52 38 28 15 40 50 14 5 ...
##  $ sex             : Factor w/ 2 levels "Male","Female": 2 2 2 1 2 1 1 2 1 2 ...
##  $ age             : int  49 48 55 42 66 66 37 62 38 42 ...
##  $ weight_lbs      : int  190 170 163 230 135 165 150 170 146 260 ...
##  $ height_inch     : int  64 68 64 74 62 70 68 70 70 73 ...
##  $ bmi             : num  32.7 25.9 28 29.6 24.7 ...
##  $ marital         : Factor w/ 6 levels "Married","Divorced",..: 1 2 1 1 3 1 1 5 1 4 ...
##  $ pregnant        : Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA NA NA NA 2 ...
##  $ children        : int  0 0 0 1 0 0 3 0 2 3 ...
##  $ education       : Factor w/ 6 levels "1","2","3","4",..: 6 5 4 6 5 5 6 6 4 5 ...
##  $ employment      : Factor w/ 7 levels "1","2","3","4",..: 2 1 5 1 1 6 2 6 1 3 ...
##  $ income          : Factor w/ 10 levels "<10k","10-15k",..: 6 6 1 8 7 6 8 1 7 3 ...
##  $ veteran         : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 5 3 5 5 5 5 ...
##  $ hispanic        : Factor w/ 2 levels "Yes","No": 2 2 1 2 2 2 2 2 2 2 ...
##  $ health_general  : Factor w/ 6 levels "Excellent","VeryGood",..: 3 4 4 1 1 1 2 5 5 3 ...
##  $ health_physical : int  3 4 0 0 0 0 0 30 30 0 ...
##  $ health_mental   : int  15 30 0 0 0 0 0 30 30 20 ...
##  $ health_poor     : int  2 3 NA NA NA NA NA 30 14 4 ...
##  $ health_cover    : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 2 1 1 ...
##  $ provide_care    : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 1 2 2 2 ...
##  $ activity_limited: Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 1 NA ...
##  $ drink_any       : Factor w/ 2 levels "Yes","No": 2 2 2 1 2 2 1 2 2 NA ...
##  $ drink_days      : int  NA NA NA 15 NA NA 2 NA NA NA ...
##  $ drink_average   : int  NA NA NA NA NA NA 2 NA NA NA ...
##  $ smoke_100       : Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 1 1 1 ...
##  $ smoke_days      : Factor w/ 3 levels "Everyday","Somedays",..: NA NA NA NA 1 NA NA 3 1 3 ...
##  $ smoke_stop      : Factor w/ 2 levels "Yes","No": NA NA NA NA 1 NA NA NA 1 NA ...
##  $ smoke_last      : Factor w/ 6 levels "3","4","5","6",..: NA NA NA NA NA NA NA 5 NA 3 ...
##  $ diet_fruit      : int  1095 52 36 NA -7 24 52 156 24 NA ...
##  $ diet_salad      : int  261 209 156 NA 261 52 156 24 84 NA ...
##  $ diet_potato     : int  104 52 52 NA 209 104 24 52 144 NA ...
##  $ diet_carrot     : int  156 0 24 NA 261 52 24 104 24 NA ...
##  $ diet_vegetable  : int  521 52 24 NA 365 365 730 365 0 NA ...
##  $ diet_juice      : int  12 0 24 NA 104 365 104 0 0 NA ...
# Visualize all of the missingness in the `riskfactors`  dataset
naniar::vis_miss(riskfactors)

# Visualize and cluster all of the missingness in the `riskfactors` dataset
naniar::vis_miss(riskfactors, cluster = TRUE)

# visualise and sort the columns by missingness in the `riskfactors` dataset
naniar::vis_miss(riskfactors, sort_miss = TRUE)

# Visualize the number of missings in cases using `gg_miss_case()`
naniar::gg_miss_case(riskfactors)

# Explore the number of missings in cases using `gg_miss_case()` and facet by the variable `education`
naniar::gg_miss_case(riskfactors, facet = education)
## Warning: Factor `education` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `education` contains implicit NA, consider using
## `forcats::fct_explicit_na`

# Visualize the number of missings in variables using `gg_miss_var()`
naniar::gg_miss_var(riskfactors)

# Explore the number of missings in variables using `gg_miss_var()` and facet by the variable `education`
naniar::gg_miss_var(riskfactors, facet = education)
## Warning: Factor `education` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `education` contains implicit NA, consider using
## `forcats::fct_explicit_na`

# Using the `airquality` dataset, explore the missingness pattern using `gg_miss_upset()`
naniar::gg_miss_upset(airquality)

# With the `riskfactors` dataset, explore how the missingness changes across the `marital` using `gg_miss_fct()`
naniar::gg_miss_fct(x = riskfactors, fct = marital)
## Warning: Factor `marital` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `marital` contains implicit NA, consider using
## `forcats::fct_explicit_na`

# Using the `pedestrian` dataset Explore how the missingness changes over a span of 3000 
naniar::gg_miss_span(pedestrian, var = hourly_counts, span_every = 3000)

# Using the `pedestrian` dataset: Explore the impact of `month` by facetting by `month`
# and explore how missingness changes for a span of 1000
naniar::gg_miss_span(pedestrian, var = hourly_counts, span_every = 1000, facet = month)


Chapter 2 - Wrangling and Tidying Missing Values

Search for and replace missing values:

  • May find that not all missing values are coded as NA - N/A or “missing” or the like
  • Can use the “chaos” dataset
    • chaos %>% miss_scan_count(search = list(“N/A”))
    • chaos %>% miss_scan_count(search = list(“N/A”, “N/a”))
  • Can replace the other specifications of NA
    • chaos %>% replace_with_na(replace = list(grade = c(“N/A”, “N/a”)))
  • Can use the “scoped variance” features similar to dplyr
    • chaos %>% replace_with_na_all(condition = ~.x == -99) # ~ is for function, and .x is the reference to a variable
    • chaos %>% replace_with_na_all(condition = ~.x %in% c(“N/A”, “missing”, “na”))

Filling down missing values:

  • May want to manage implied missing values - “missing missing values”
  • Can spread the data to make missing values more obvious; may help to “un-tidy” the data
    • tetris %>% tidyr::complete(name, time) # will add the missing combinations, keeping the data in tidy format
  • May want to fill a value down, for example when name is added only the first time the record is shown
    • tetris %>% tidyr::fill(name) # default LOCF

Missing data dependence:

  • Missing data dependence has theory associate with at
    • MCAR - missing compleetly at random
    • MAR - missing at random
    • MNAR - missing not at random
  • MCAR is where the data has no association with other observed or unobserved data (e.g., missing workers on vacation)
    • Implications are that imputing is OK; deleting might be OK, but may lose too much data (delete only if ~5% loss or better)
  • MAR is where missingness depends on data observed but not on data unobserved (e.g., missing workers more likely with depression)
    • Do not delete and be careful with imputing
  • MNAR is where missingness is related to an unobserved variable of interest
  • Can use visualizations to see the potential structures of various types of missing data
    • vis_miss(mt_cars, cluster = TRUE) # noisy patterns suggest MCAR
    • oceanbuoys %>% arrange(year) %>% vis_miss() # clustering of missingness suggest MAR
    • vis_miss(ocean, cluster = TRUE) # blocks of data may suggest MNAR, but can be challenging to find

Example code includes:

pacman <- tibble(year=c('2004', '1991', 'na', '1992', '1988', '2007', '2016', '2011', '2018', '2012', '1983', '1988', '1981', '1990', '1989', '1995', 'missing', 'missing', '2003', '2000', '2012', '2008', '2007', '1987', '2009', '1987', '2016', '2011', '2008', '1984', '2003', '1988', '2001', '1990', '2018', '1985', '2010', '1986', '1980', '1982', '2009', '1998', '1991', '1987', '1982', '1998', '2004', '2007', '2000', '2014', '1980', '1983', '2011', '2003', '2013', '2018', '2006', '2005', '1994', '2009', '2004', '1991', 'na', '2004', '1993', '1989', '2004', '2011', '1990', '1985', '2017', '1992', '1999', '2014', '1996', '2007', '2008', '1998', '1996', '1998', '2017', '1998', '2016', '1983', '2009', 'missing', '1993', '1989', '1994', '1980', '1983', '2004', 'missing', '1997', '1994', '2008', 'missing', '2007', '2016', '1992', '2000', '2002', '2004', '2007', '2013', '1983', '2005', '1999', '1990', '1998', '1982', '2002', 'na', '1998', '2006', '2004', '2012', '1981', '2000', '2014', '1999', '1997', '2003', '1993', '1982', '1992', '2008', '1985', '2016', '1990', '1991', '1980', '2000', 'na', '2018', 'na', '2014', '1988', 'missing', '2002', '2012', '2017', '1987', '1998', '1999', '1985', '1989', '2017', '1982', '1994', '2003', 'na', '2011', 'missing', 'missing', '1986', '2007', '2006', 'missing', '2010', '1982', '2008', '1983', '2018', '1987', '1983', 'missing', 'missing', '1998', '1988', '2010', '1981', 'na', '2016', 'na', '1992', '2001', '1995', '1999', '2009', 'na', 'na', '2003', '2017', 'na', '1982', '2005', '2013', '1990', '2004', '2004', '2006', '2009', '1984', '2007', '1987', 'na', '2001', '1983', '2012'), 
                 month=c('6', '11', 'na', '11', '9', '12', '9', '1', '4', '9', '4', '11', '6', '7', '10', '8', 'missing', 'missing', '5', '5', '8', '1', '10', '11', '6', '7', '10', '8', '7', '1', '9', '10', '11', '7', '1', '5', '10', '6', '8', '11', '11', '8', '10', '8', '1', '9', '9', '7', '11', '11', '10', '7', '12', '9', '12', '8', '11', '4', '11', '1', '1', '9', 'na', '7', '10', '10', '3', '3', '9', '5', '8', '1', '5', '12', '6', '3', '7', '9', '12', '2', '5', '8', '4', '6', '1', 'missing', '8', '1', '12', '2', '5', '8', 'missing', '7', '2', '7', 'missing', '4', '11', '6', '5', '11', '12', '3', '3', '5', '1', '6', '12', '1', '11', '7', 'na', '9', '11', '7', '9', '10', '10', '11', '5', '11', '6', '6', '4', '10', '10', '1', '4', '6', '8', '12', '11', 'na', '11', 'na', '10', '9', 'missing', '5', '1', '5', '4', '3', '2', '11', '2', '9', '3', '3', '5', 'na', '12', 'missing', 'missing', '4', '1', '2', 'missing', '7', '2', '1', '4', '9', '4', '3', 'missing', 'missing', '1', '12', '2', '4', 'na', '3', 'na', '11', '7', '2', '5', '1', 'na', 'na', '1', '4', 'na', '10', '5', '4', '8', '2', '9', '11', '7', '10', '2', '9', 'na', '4', '6', '10'), 
                 day=c('1', '22', 'na', '16', '16', '4', '5', '25', '14', '25', '1', '8', '17', '14', '15', '18', 'missing', 'missing', '21', '18', '21', '2', '15', '18', '22', '26', '25', '3', '24', '6', '18', '6', '4', '1', '23', '3', '10', '23', '11', '4', '11', '17', '23', '10', '8', '19', '10', '6', '24', '9', '25', '18', '7', '25', '24', '23', '17', '8', '10', '17', '8', '8', 'na', '24', '25', '7', '6', '19', '10', '13', '24', '13', '26', '4', '5', '21', '28', '15', '22', '10', '11', '15', '20', '23', '6', 'missing', '3', '2', '24', '11', '21', '21', 'missing', '24', '13', '6', 'missing', '14', '13', '17', '11', '18', '24', '9', '6', '1', '11', '21', '3', '12', '23', '27', 'na', '1', '13', '7', '17', '11', '13', '11', '20', '7', '2', '10', '9', '24', '21', '12', '25', '17', '14', '24', '18', 'na', '7', 'na', '22', '8', 'missing', '12', '14', '15', '21', '21', '1', '11', '18', '9', '18', '15', '1', 'na', '9', 'missing', 'missing', '28', '13', '12', 'missing', '3', '16', '5', '3', '19', '14', '7', 'missing', 'missing', '4', '25', '15', '4', 'na', '21', 'na', '5', '11', '10', '11', '15', 'na', 'na', '2', '10', 'na', '4', '16', '12', '11', '19', '5', '23', '24', '22', '27', '11', 'na', '23', '21', '27'), 
                 initial=c('XGB', 'VGP', 'UAW', 'MXL', 'ZPM', 'ESF', 'YKM', 'ABS', 'NDT', 'GAS', 'IFA', 'OUH', 'PZB', 'EKR', 'TXO', 'NCV', 'XSL', 'ATM', 'LEN', 'QNE', 'CBV', 'DLU', 'LTW', 'TCV', 'BVC', 'GSP', 'LVJ', 'YQD', 'HSX', 'KNX', 'PYK', 'PVD', 'OAB', 'GHB', 'LCI', 'HMU', 'VRQ', 'WAJ', 'AIK', 'YPJ', 'BMO', 'YEH', 'YHK', 'YIA', 'TDA', 'XYF', 'LMH', 'JTO', 'ZFD', 'SXE', 'QYC', 'MPI', 'TSI', 'IVR', 'ILM', 'CME', 'FVU', 'HFJ', 'DEF', 'TCX', 'BGA', 'PBK', 'TIB', 'FYX', 'OJA', 'GEH', 'LJB', 'IHF', 'NMS', 'WSC', 'WTO', 'JBV', 'JQI', 'TCP', 'MLU', 'NBM', 'QMY', 'DLV', 'UHP', 'BGE', 'WCR', 'DNC', 'KZS', 'DBM', 'IUC', 'LRG', 'ONT', 'VKF', 'GFU', 'EQI', 'CUR', 'SAZ', 'CFU', 'SOH', 'QTM', 'CZV', 'QNR', 'LMG', 'SGR', 'DXC', 'BKI', 'CMP', 'VDR', 'CIA', 'QYW', 'CJR', 'HJQ', 'NTE', 'EGA', 'ZUY', 'AMT', 'LKP', 'HFW', 'PZQ', 'PJI', 'QJB', 'LAU', 'XYO', 'OJV', 'OBZ', 'QPV', 'LAH', 'UHW', 'XIT', 'UMB', 'OPM', 'GSC', 'PFU', 'OEC', 'ERU', 'ZWA', 'CJA', 'IGE', 'ZBQ', 'XVO', 'BWF', 'VAW', 'WDQ', 'JWT', 'QCT', 'JAH', 'WAQ', 'RCS', 'JPL', 'KCF', 'NXE', 'OPW', 'WYP', 'RMS', 'LND', 'YVO', 'XIR', 'AUW', 'OLA', 'ORF', 'ZAU', 'FXE', 'ACE', 'FQW', 'BND', 'SKA', 'BZX', 'JKY', 'IOZ', 'IYG', 'YZK', 'FOU', 'ZJT', 'XLA', 'TEZ', 'YKB', 'CYS', 'UBJ', 'DKO', 'EWZ', 'PBU', 'GEU', 'LVW', 'YWO', 'WBH', 'GXH', 'NPY', 'UIW', 'EXP', 'QAX', 'RCH', 'ZFM', 'SML', 'FNC', 'HQI', 'NQO', 'QLM', 'EGI', 'CIQ', 'ORU', 'AGP', 'MPY', 'EFL', 'VXR', 'QYE'), 
                 score=c('892369', '2412494', '1874449', '1583331', '3159043', '2755582', '804088', '2392395', '431430', '1482088', '3099396', '810873', '2410285', '1602619', 'N/A', '1547264', '1086746', '885575', '2464437', '333868', '2991881', '1207552', '332352', '115716', 'N/A', '2551711', '679715', '3033343', '275723', '1677698', '1031285', '3251416', '1812998', '1767317', '2457197', '2194699', '1258734', '535437', '3202731', '899729', '1099688', '2125942', '2407498', '1785754', '2181741', '1058088', '1630900', '1629161', '2378243', '3211114', '65436', '2006229', '2068916', '1653110', '2589346', '1520554', '374610', 'N/A', '2841676', '1001739', '438268', '2476918', '2584965', '702929', '189630', 'N/A', '410549', '1269273', '2658430', '1760979', 'N/A', '2705304', '1560004', '826721', '3291811', '2366950', '832279', '426785', '2898752', '1369821', '2712315', '2123280', '2513951', '1004901', '645429', '846193', '313628', '1791507', '2612127', '836682', '1955459', '1866444', '75834', '532534', '3267355', '235734', '2279669', '2976729', '2297788', '1166581', '15715', '890432', '1670356', '1463904', '2867923', '1761345', '2667484', '2357424', '3053758', '2077402', '1052647', '1661650', '123930', '3171836', '1910536', '2100782', '679137', '1424599', '2194459', '1263044', '1948854', 'N/A', '3092624', '2077243', '1010777', '3289300', '3172553', '891045', '1592747', '728752', 'N/A', '24667', 'N/A', '827488', '1643701', '2844488', '539713', '3160321', '762261', '2505569', '271322', '1479487', '1217212', '2960042', '1825455', '1287888', '2105751', '450550', '894755', '3115431', '781721', '3220718', '767717', '3204211', '1666549', '3128098', '2445271', '1571440', '2088915', '645360', '2321491', '1135310', '1736847', '2378391', '3097570', '1220994', '165122', '2007635', '876910', '1551229', '1357429', '2168680', '1411345', '3290465', '1860365', '3181429', '2872190', '2780599', '2160057', '60716', '2222480', '22113', '2815280', 'N/A', '2517561', '500742', '3077608', '1481553', '1349499', '2539062', '2057675', '2869686', '863857', '2609949', '2337505', '76444', '3062706', '3031438', '759570', '1741154'), 
                 country=c(' ', 'US', ' ', ' ', ' ', 'US', 'NZ', 'CA', 'GB', 'CN', 'ES', 'US', 'NZ', 'AU', 'CN', 'US', 'CA', 'US', 'US', 'CN', 'AU', 'ES', 'NZ', 'CA', 'CN', 'ES', 'NZ', 'NZ', 'CN', 'GB', 'CN', 'US', 'ES', 'CN', 'US', 'CN', 'AU', 'GB', 'ES', 'AT', ' ', 'US', 'NZ', 'AU', ' ', 'US', 'US', 'US', 'ES', 'NZ', 'AT', 'NZ', 'JP', 'ES', 'NZ', 'NZ', 'GB', 'CN', 'AU', 'GB', 'ES', 'GB', 'AT', 'NZ', 'CN', 'US', 'AU', 'GB', 'US', 'JP', 'CA', 'AT', 'AT', 'CN', 'AU', 'JP', 'CA', 'GB', 'AT', 'AU', 'GB', 'CN', 'AU', 'GB', 'AT', 'NZ', 'JP', 'GB', ' ', 'CN', 'US', 'JP', 'CN', 'GB', 'GB', 'GB', 'AT', 'US', 'GB', 'GB', 'JP', 'CN', 'AU', 'AU', 'AT', 'JP', 'US', 'JP', 'NZ', 'JP', 'AT', 'NZ', 'CA', 'CA', 'GB', 'ES', 'ES', 'GB', 'ES', 'GB', 'AU', 'GB', 'AT', 'CN', 'AT', 'ES', ' ', 'CA', 'CA', 'GB', 'AU', 'CN', 'ES', 'NZ', 'CA', 'JP', 'JP', 'NZ', 'GB', 'CA', 'ES', 'AT', 'AU', 'CA', 'CN', 'US', 'JP', 'AT', 'CA', 'JP', ' ', 'GB', 'GB', 'NZ', 'AU', 'JP', 'US', 'US', 'AU', 'US', 'AT', 'GB', 'GB', 'GB', 'AT', 'CN', 'ES', 'US', 'JP', 'GB', 'AT', 'JP', 'AU', 'NZ', 'GB', 'GB', 'ES', 'ES', 'AT', 'GB', 'CN', ' ', 'US', 'JP', 'AT', 'US', 'CA', 'AT', 'US', 'GB', 'US', 'ES', 'US', ' ', 'ES', 'JP', 'CA', 'AU', 'CA', 'US')                 )
pacman
## # A tibble: 200 x 6
##    year  month day   initial score   country
##    <chr> <chr> <chr> <chr>   <chr>   <chr>  
##  1 2004  6     1     XGB     892369  " "    
##  2 1991  11    22    VGP     2412494 US     
##  3 na    na    na    UAW     1874449 " "    
##  4 1992  11    16    MXL     1583331 " "    
##  5 1988  9     16    ZPM     3159043 " "    
##  6 2007  12    4     ESF     2755582 US     
##  7 2016  9     5     YKM     804088  NZ     
##  8 2011  1     25    ABS     2392395 CA     
##  9 2018  4     14    NDT     431430  GB     
## 10 2012  9     25    GAS     1482088 CN     
## # ... with 190 more rows
# Explore all of the strange missing values, "N/A", "missing", " ", "na"
naniar::miss_scan_count(data = pacman, search = list("N/A", "missing", " ", "na"))
## # A tibble: 6 x 2
##   Variable     n
##   <chr>    <int>
## 1 year        23
## 2 month       23
## 3 day         23
## 4 initial      0
## 5 score        9
## 6 country     11
# Print the top of the pacman data using `head()`
head(pacman)
## # A tibble: 6 x 6
##   year  month day   initial score   country
##   <chr> <chr> <chr> <chr>   <chr>   <chr>  
## 1 2004  6     1     XGB     892369  " "    
## 2 1991  11    22    VGP     2412494 US     
## 3 na    na    na    UAW     1874449 " "    
## 4 1992  11    16    MXL     1583331 " "    
## 5 1988  9     16    ZPM     3159043 " "    
## 6 2007  12    4     ESF     2755582 US
# Replace the strange missing values "N/A" and "missing" with `NA`
pacman_clean <- naniar::replace_with_na(pacman, replace = list(year = c("N/A", "na", "missing"),
                                                               score = c("N/A", "na", "missing")
                                                               )
                                        )
                                        
# Test if `pacman_clean` still has these values in it?
naniar::miss_scan_count(pacman_clean, search = list("N/A", "na", "missing"))
## # A tibble: 6 x 2
##   Variable     n
##   <chr>    <int>
## 1 year         0
## 2 month       23
## 3 day         23
## 4 initial      0
## 5 score        0
## 6 country      0
# Use `replace_with_na_at()` to replace with NA
naniar::replace_with_na_at(pacman, .vars = c("year", "month", "day"),  
                           ~.x %in% c("N/A", "missing", "na", " ")
                           )
## # A tibble: 200 x 6
##    year  month day   initial score   country
##    <chr> <chr> <chr> <chr>   <chr>   <chr>  
##  1 2004  6     1     XGB     892369  " "    
##  2 1991  11    22    VGP     2412494 US     
##  3 <NA>  <NA>  <NA>  UAW     1874449 " "    
##  4 1992  11    16    MXL     1583331 " "    
##  5 1988  9     16    ZPM     3159043 " "    
##  6 2007  12    4     ESF     2755582 US     
##  7 2016  9     5     YKM     804088  NZ     
##  8 2011  1     25    ABS     2392395 CA     
##  9 2018  4     14    NDT     431430  GB     
## 10 2012  9     25    GAS     1482088 CN     
## # ... with 190 more rows
# Use `replace_with_na_if()` to replace with NA
naniar::replace_with_na_if(pacman, .predicate = is.character, 
                           ~.x %in% c("N/A", "missing", "na")
                           )
## # A tibble: 200 x 6
##    year  month day   initial score   country
##    <chr> <chr> <chr> <chr>   <chr>   <chr>  
##  1 2004  6     1     XGB     892369  " "    
##  2 1991  11    22    VGP     2412494 US     
##  3 <NA>  <NA>  <NA>  UAW     1874449 " "    
##  4 1992  11    16    MXL     1583331 " "    
##  5 1988  9     16    ZPM     3159043 " "    
##  6 2007  12    4     ESF     2755582 US     
##  7 2016  9     5     YKM     804088  NZ     
##  8 2011  1     25    ABS     2392395 CA     
##  9 2018  4     14    NDT     431430  GB     
## 10 2012  9     25    GAS     1482088 CN     
## # ... with 190 more rows
# Use `replace_with_na_all()` to replace with NA
naniar::replace_with_na_all(pacman, ~.x %in% c("N/A", "missing", "na"))
## # A tibble: 200 x 6
##    year  month day   initial score   country
##    <chr> <chr> <chr> <chr>   <chr>   <chr>  
##  1 2004  6     1     XGB     892369  " "    
##  2 1991  11    22    VGP     2412494 US     
##  3 <NA>  <NA>  <NA>  UAW     1874449 " "    
##  4 1992  11    16    MXL     1583331 " "    
##  5 1988  9     16    ZPM     3159043 " "    
##  6 2007  12    4     ESF     2755582 US     
##  7 2016  9     5     YKM     804088  NZ     
##  8 2011  1     25    ABS     2392395 CA     
##  9 2018  4     14    NDT     431430  GB     
## 10 2012  9     25    GAS     1482088 CN     
## # ... with 190 more rows
frogger <- tibble(name=factor(c('jesse', 'jesse', 'jesse', 'jesse', 'andy', 'andy', 'andy', 'nic', 'nic', 'dan', 'dan', 'alex', 'alex', 'alex', 'alex')), 
                  time=factor(c('morning', 'afternoon', 'evening', 'late_night', 'morning', 'afternoon', 'late_night', 'afternoon', 'late_night', 'morning', 'evening', 'morning', 'afternoon', 'evening', 'late_night')), 
                  value=as.integer(c(6678, 800060, 475528, 143533, 425115, 587468, 111000, 588532, 915533, 388148, 180912, 552670, 98355, 266055, 121056))
                  )
str(frogger)
## Classes 'tbl_df', 'tbl' and 'data.frame':    15 obs. of  3 variables:
##  $ name : Factor w/ 5 levels "alex","andy",..: 4 4 4 4 2 2 2 5 5 3 ...
##  $ time : Factor w/ 4 levels "afternoon","evening",..: 4 1 2 3 4 1 3 1 3 4 ...
##  $ value: int  6678 800060 475528 143533 425115 587468 111000 588532 915533 388148 ...
# Use `complete()` on the `time` variable to make implicit missing values explicit
frogger
## # A tibble: 15 x 3
##    name  time        value
##    <fct> <fct>       <int>
##  1 jesse morning      6678
##  2 jesse afternoon  800060
##  3 jesse evening    475528
##  4 jesse late_night 143533
##  5 andy  morning    425115
##  6 andy  afternoon  587468
##  7 andy  late_night 111000
##  8 nic   afternoon  588532
##  9 nic   late_night 915533
## 10 dan   morning    388148
## 11 dan   evening    180912
## 12 alex  morning    552670
## 13 alex  afternoon   98355
## 14 alex  evening    266055
## 15 alex  late_night 121056
frogger_tidy <- frogger %>% 
    complete(name, time)
frogger_tidy
## # A tibble: 20 x 3
##    name  time        value
##    <fct> <fct>       <int>
##  1 alex  afternoon   98355
##  2 alex  evening    266055
##  3 alex  late_night 121056
##  4 alex  morning    552670
##  5 andy  afternoon  587468
##  6 andy  evening        NA
##  7 andy  late_night 111000
##  8 andy  morning    425115
##  9 dan   afternoon      NA
## 10 dan   evening    180912
## 11 dan   late_night     NA
## 12 dan   morning    388148
## 13 jesse afternoon  800060
## 14 jesse evening    475528
## 15 jesse late_night 143533
## 16 jesse morning      6678
## 17 nic   afternoon  588532
## 18 nic   evening        NA
## 19 nic   late_night 915533
## 20 nic   morning        NA
# Use `fill()` to fill down the name variable in the frogger dataset
frogger
## # A tibble: 15 x 3
##    name  time        value
##    <fct> <fct>       <int>
##  1 jesse morning      6678
##  2 jesse afternoon  800060
##  3 jesse evening    475528
##  4 jesse late_night 143533
##  5 andy  morning    425115
##  6 andy  afternoon  587468
##  7 andy  late_night 111000
##  8 nic   afternoon  588532
##  9 nic   late_night 915533
## 10 dan   morning    388148
## 11 dan   evening    180912
## 12 alex  morning    552670
## 13 alex  afternoon   98355
## 14 alex  evening    266055
## 15 alex  late_night 121056
frogger %>% 
    fill(name)
## # A tibble: 15 x 3
##    name  time        value
##    <fct> <fct>       <int>
##  1 jesse morning      6678
##  2 jesse afternoon  800060
##  3 jesse evening    475528
##  4 jesse late_night 143533
##  5 andy  morning    425115
##  6 andy  afternoon  587468
##  7 andy  late_night 111000
##  8 nic   afternoon  588532
##  9 nic   late_night 915533
## 10 dan   morning    388148
## 11 dan   evening    180912
## 12 alex  morning    552670
## 13 alex  afternoon   98355
## 14 alex  evening    266055
## 15 alex  late_night 121056
# Correctly fill() and complete() missing values so that our dataset becomes sensible
frogger
## # A tibble: 15 x 3
##    name  time        value
##    <fct> <fct>       <int>
##  1 jesse morning      6678
##  2 jesse afternoon  800060
##  3 jesse evening    475528
##  4 jesse late_night 143533
##  5 andy  morning    425115
##  6 andy  afternoon  587468
##  7 andy  late_night 111000
##  8 nic   afternoon  588532
##  9 nic   late_night 915533
## 10 dan   morning    388148
## 11 dan   evening    180912
## 12 alex  morning    552670
## 13 alex  afternoon   98355
## 14 alex  evening    266055
## 15 alex  late_night 121056
frogger %>% 
    fill(name) %>%
    complete(name, time)
## # A tibble: 20 x 3
##    name  time        value
##    <fct> <fct>       <int>
##  1 alex  afternoon   98355
##  2 alex  evening    266055
##  3 alex  late_night 121056
##  4 alex  morning    552670
##  5 andy  afternoon  587468
##  6 andy  evening        NA
##  7 andy  late_night 111000
##  8 andy  morning    425115
##  9 dan   afternoon      NA
## 10 dan   evening    180912
## 11 dan   late_night     NA
## 12 dan   morning    388148
## 13 jesse afternoon  800060
## 14 jesse evening    475528
## 15 jesse late_night 143533
## 16 jesse morning      6678
## 17 nic   afternoon  588532
## 18 nic   evening        NA
## 19 nic   late_night 915533
## 20 nic   morning        NA
data("oceanbuoys", package="naniar")
str(oceanbuoys)
## Classes 'tbl_df', 'tbl' and 'data.frame':    736 obs. of  8 variables:
##  $ year      : num  1997 1997 1997 1997 1997 ...
##  $ latitude  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ longitude : num  -110 -110 -110 -110 -110 -110 -110 -110 -110 -110 ...
##  $ sea_temp_c: num  27.6 27.5 27.6 27.6 27.6 ...
##  $ air_temp_c: num  27.1 27 27 26.9 26.8 ...
##  $ humidity  : num  79.6 75.8 76.5 76.2 76.4 ...
##  $ wind_ew   : num  -6.4 -5.3 -5.1 -4.9 -3.5 ...
##  $ wind_ns   : num  5.4 5.3 4.5 2.5 4.1 ...
# Arrange by year
oceanbuoys %>% 
    arrange(year) %>% 
    naniar::vis_miss()

# Arrange by latitude
oceanbuoys %>% 
    arrange(latitude) %>% 
    naniar::vis_miss()

# Arrange by wind_ew (wind east west)
oceanbuoys %>% 
    arrange(wind_ew) %>% 
    naniar::vis_miss()


Chapter 3 - Testing Missing Relationships

Tools to explore missing data dependence:

  • Example of census data containing income and education
    • Faceting by education will show differences in income by education
  • Can consider a shadow matrix where 1 means missing and 0 means present
    • The shadow matrix variables have the same names as the regular data, with _NA as a suffix
    • Clear values are given; they are either NA or !NA
    • Can also bind the shadow matrix to the original data (“nabular” data)
    • bind_shadow(airquality)
    • airquality %>% bind_shadow() %>% group_by(Ozone_NA) %>% summarise(mean = mean(Wind))

Visualizing missingness across one variable:

  • Can explore conditional plots based on the missingness status of the data
    • ggplot(airquality, aes(x = Temp)) + geom_density()
    • airquality %>% bind_shadow() %>% ggplot(aes(x = Temp, color = Ozone_NA)) + geom_density()
    • airquality %>% bind_shadow() %>% ggplot(aes(x = Ozone_NA, y = Temp)) + geom_boxplot()
    • airquality %>% bind_shadow() %>% ggplot(aes(x = Temp)) + geom_density() + facet_wrap(~Ozone_NA)
    • airquality %>% bind_shadow() %>% ggplot(aes(x = Temp, y = Wind)) + geom_point() + facet_wrap(~Ozone_NA)
    • airquality %>% bind_shadow() %>% ggplot(aes(x = Temp, y = Wind, color = Ozone_NA)) + geom_point()
    • airquality %>% bind_shadow() %>% ggplot(aes(x = Temp, color = Ozone_NA)) + geom_density() + facet_wrap(~Solar.R_NA)

Visualizing misingness across two variables:

  • Missing values are frequently ignored in a scatterplot, but it may be helpful to still consider them
    • ggplot(airquality, aes(x = Ozone, y = Solar.R)) + geom_miss_point() # imputes to 10% below the minimum for the missing dimension
    • ggplot(airquality, aes(x = Wind, y = Ozone)) + geom_miss_point() + facet_wrap(~Month)
    • airquality %>% bind_shadow() %>% ggplot(aes(x = Wind, y = Ozone)) + geom_miss_point() + facet_wrap(~Solar.R_NA)

Example code includes:

# Create shadow matrix data with `as_shadow()`
naniar::as_shadow(oceanbuoys)
## # A tibble: 736 x 8
##    year_NA latitude_NA longitude_NA sea_temp_c_NA air_temp_c_NA humidity_NA
##    <fct>   <fct>       <fct>        <fct>         <fct>         <fct>      
##  1 !NA     !NA         !NA          !NA           !NA           !NA        
##  2 !NA     !NA         !NA          !NA           !NA           !NA        
##  3 !NA     !NA         !NA          !NA           !NA           !NA        
##  4 !NA     !NA         !NA          !NA           !NA           !NA        
##  5 !NA     !NA         !NA          !NA           !NA           !NA        
##  6 !NA     !NA         !NA          !NA           !NA           !NA        
##  7 !NA     !NA         !NA          !NA           !NA           !NA        
##  8 !NA     !NA         !NA          !NA           !NA           !NA        
##  9 !NA     !NA         !NA          !NA           !NA           !NA        
## 10 !NA     !NA         !NA          !NA           !NA           !NA        
## # ... with 726 more rows, and 2 more variables: wind_ew_NA <fct>,
## #   wind_ns_NA <fct>
# Create nabular data by binding the shadow to the data with `bind_shadow()`
naniar::bind_shadow(oceanbuoys)
## # A tibble: 736 x 16
##     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
##    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
##  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
##  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
##  3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
##  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
##  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
##  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
##  7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
##  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
##  9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
## 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
## # ... with 726 more rows, and 8 more variables: year_NA <fct>,
## #   latitude_NA <fct>, longitude_NA <fct>, sea_temp_c_NA <fct>,
## #   air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>,
## #   wind_ns_NA <fct>
# Bind only the variables with missing values by using bind_shadow(only_miss = TRUE)
naniar::bind_shadow(oceanbuoys, only_miss = TRUE)
## # A tibble: 736 x 11
##     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
##    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
##  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
##  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
##  3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
##  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
##  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
##  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
##  7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
##  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
##  9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
## 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
## # ... with 726 more rows, and 3 more variables: sea_temp_c_NA <fct>,
## #   air_temp_c_NA <fct>, humidity_NA <fct>
# `bind_shadow()` and `group_by()` humidity missingness (`humidity_NA`)
oceanbuoys %>%
    naniar::bind_shadow() %>%
    group_by(humidity_NA) %>% 
    summarise(wind_ew_mean = mean(wind_ew), 
              wind_ew_sd = sd(wind_ew)
              ) 
## # A tibble: 2 x 3
##   humidity_NA wind_ew_mean wind_ew_sd
##   <fct>              <dbl>      <dbl>
## 1 !NA                -3.78       1.90
## 2 NA                 -3.30       2.31
# Repeat this, but calculating summaries for wind north south (`wind_ns`).
oceanbuoys %>%
    naniar::bind_shadow() %>%
    group_by(humidity_NA) %>% 
    summarise(wind_ns_mean = mean(wind_ns), 
              wind_ns_sd = sd(wind_ns)
              )
## # A tibble: 2 x 3
##   humidity_NA wind_ns_mean wind_ns_sd
##   <fct>              <dbl>      <dbl>
## 1 !NA                 2.78       2.06
## 2 NA                  1.66       2.23
# Summarise wind_ew by the missingness of `air_temp_c_NA`
oceanbuoys %>% 
    naniar::bind_shadow() %>%
    group_by(air_temp_c_NA) %>%
    summarise(wind_ew_mean = mean(wind_ew),
              wind_ew_sd = sd(wind_ew),
              n_obs = n()
              )
## # A tibble: 2 x 4
##   air_temp_c_NA wind_ew_mean wind_ew_sd n_obs
##   <fct>                <dbl>      <dbl> <int>
## 1 !NA                  -3.91       1.85   655
## 2 NA                   -2.17       2.14    81
# Summarise wind_ew by missingness of `air_temp_c_NA` and `humidity_NA`
oceanbuoys %>% 
    naniar::bind_shadow() %>%
    group_by(air_temp_c_NA, humidity_NA) %>%
    summarise(wind_ew_mean = mean(wind_ew),
              wind_ew_sd = sd(wind_ew),
              n_obs = n()
              )
## # A tibble: 4 x 5
## # Groups:   air_temp_c_NA [2]
##   air_temp_c_NA humidity_NA wind_ew_mean wind_ew_sd n_obs
##   <fct>         <fct>              <dbl>      <dbl> <int>
## 1 !NA           !NA                -4.01       1.74   565
## 2 !NA           NA                 -3.24       2.31    90
## 3 NA            !NA                -2.06       2.08    78
## 4 NA            NA                 -4.97       1.74     3
# First explore the missingness structure of `oceanbuoys` using `vis_miss()`
naniar::vis_miss(oceanbuoys)

# Explore the distribution of `wind_ew` for the missingness of `air_temp_c_NA` using  `geom_density()`
naniar::bind_shadow(oceanbuoys) %>%
    ggplot(aes(x = wind_ew, color = air_temp_c_NA)) + 
    geom_density()

# Explore the distribution of sea temperature for the missingness of humidity (humidity_NA) using  `geom_density()`
naniar::bind_shadow(oceanbuoys) %>%
    ggplot(aes(x = sea_temp_c, color = humidity_NA)) + 
    geom_density()
## Warning: Removed 3 rows containing non-finite values (stat_density).

# Explore the distribution of wind east west (`wind_ew`) for the missingness of air temperature using  `geom_density()` and facetting by the missingness of air temperature (`air_temp_c_NA`).
oceanbuoys %>%
    naniar::bind_shadow() %>%
    ggplot(aes(x = wind_ew)) + 
    geom_density() + 
    facet_wrap(~air_temp_c_NA)

# Build upon this visualisation by coloring by the missingness of humidity (`humidity_NA`).
oceanbuoys %>%
    naniar::bind_shadow() %>%
    ggplot(aes(x = wind_ew, color = humidity_NA)) + 
    geom_density() + 
    facet_wrap(~air_temp_c_NA)

# Explore the distribution of wind east west (`wind_ew`) for the missingness of air temperature using  `geom_boxplot()`
oceanbuoys %>%
    naniar::bind_shadow() %>%
    ggplot(aes(x = air_temp_c_NA, y = wind_ew)) + 
    geom_boxplot()

# Build upon this visualisation by facetting by the missingness of humidity (`humidity_NA`).
oceanbuoys %>%
    naniar::bind_shadow() %>%
    ggplot(aes(x = air_temp_c_NA, y = wind_ew)) + 
    geom_boxplot() + 
    facet_wrap(~humidity_NA)

# Explore the missingness in wind and air temperature, and display the missingness using `geom_miss_point()`
ggplot(oceanbuoys, aes(x = wind_ew, y = air_temp_c)) + 
    naniar::geom_miss_point()

# Explore the missingness in humidity and air temperature, and display the missingness using `geom_miss_point()`
ggplot(oceanbuoys, aes(x = humidity, y = air_temp_c)) + 
    naniar::geom_miss_point()

# Explore the missingness in wind and air temperature, and display the missingness using `geom_miss_point()`. Facet by year to explore this further.
ggplot(oceanbuoys, aes(x = wind_ew, y = air_temp_c)) + 
    naniar::geom_miss_point() + 
    facet_wrap(~year)

# Explore the missingness in humidity and air temperature, and display the missingness using `geom_miss_point()` Facet by year to explore this further.
ggplot(oceanbuoys, aes(x=humidity, y=air_temp_c)) + 
    naniar::geom_miss_point() + 
    facet_wrap(~year)

# Use geom_miss_point() and facet_wrap to explore how the missingness in wind_ew and air_temp_c is different for missingness of humidity
naniar::bind_shadow(oceanbuoys) %>%
    ggplot(aes(x = wind_ew, y = air_temp_c)) + 
    naniar::geom_miss_point() + 
    facet_wrap(~humidity_NA)

# Use geom_miss_point() and facet_grid to explore how the missingness in wind_ew and air_temp_c is different for missingness of humidity AND by year - by using `facet_grid(humidity_NA ~ year)`
naniar::bind_shadow(oceanbuoys) %>%
    ggplot(aes(x = wind_ew, y = air_temp_c)) + 
    naniar::geom_miss_point() + 
    facet_grid(humidity_NA~year)


Chapter 4 - Imputation

Filling in the blanks:

  • Imputation can help with understanding data structure, as well as visualizing and analyzing based on imputed data
    • impute_below(c(5,6,7,NA,9,10)) # imputes the NA to be lower than anything in the data at hand
    • impute_below_if(data, is.numeric) # run only for numeric
    • impute_below_at(data, vars(var1,var2)) # select variables
    • impute_below_all(data) # all variables
  • Can also use bind_shadow() to maintain a history of which data points were initially missing and then imputed
    • aq_imp <- airquality %>% bind_shadow() %>% impute_below_all()
    • ggplot(aq_imp, aes(x = Ozone, fill = Ozone_NA)) + geom_histogram()
    • ggplot(aq_imp, aes(x = Ozone, fill = Ozone_NA)) + geom_histogram() + facet_wrap(~Month)
  • Can add labels for whether any of the data are missing
    • aq_imp <- airquality %>% bind_shadow() %>% add_label_missings() %>% impute_below_all()
    • ggplot(aq_imp, aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point()

What makes a good imputation?

  • Imputation should be done with care such that the resulting dataset are still reasonable given the rest of the data and the real world
  • Bad imputations come in many forms
    • Mean imputation - calculate the mean from the non-missing data (often ignores the underlying structure of the data)
    • aq_impute_mean <- airquality %>% bind_shadow(only_miss = TRUE) %>% impute_mean_all() %>% add_label_shadow()
  • Can explore imputations using the boxplot, scatterplot, long shadow format histograms, etc.
    • ggplot(aq_impute_mean, aes(x = Ozone_NA, y = Ozone)) + geom_boxplot()
    • ggplot(aq_impute_mean, aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point()
    • aq_imp <- airquality %>% bind_shadow() %>% impute_mean_all()
    • aq_imp_long <- shadow_long(aq_imp, Ozone, Solar.R)
    • ggplot(aq_imp_long, aes(x = value, fill = value_NA)) + geom_histogram() + facet_wrap(~variable)

Performing imputations:

  • Can use linear regression as a tool for imputation - can use the package “simputation”
    • df %>% bind_shadow(only_miss = TRUE) %>% add_label_shadow() %>% simputation::impute_lm(y ~ x1 + x2)
    • aq_imp_lm <- airquality %>% bind_shadow() %>% add_label_shadow() %>% simputation::impute_lm(Solar.R ~ Wind + Temp + Month) %>% simputation::impute_lm(Ozone ~ Wind + Temp + Month)
    • ggplot(aq_imp_lm, aes(x = Solar.R, y = Ozone, colour = any_missing)) + geom_point()
  • Can compare multiple attempts at imputation as well
    • aq_imp_small <- airquality %>% bind_shadow() %>% impute_lm(Ozone ~ Wind + Temp) %>% simputation::impute_lm(Solar.R ~ Wind + Temp) %>% add_label_shadow()
    • aq_imp_large <- airquality %>% bind_shadow() %>% impute_lm(Ozone ~ Wind + Temp + Month + Day) %>% simputation::impute_lm(Solar.R ~ Wind + Temp + Month + Day) %>% add_label_shadow()
    • bound_models <- bind_rows(small = aq_imp_small, large = aq_imp_large, .id = “imp_model”)
    • ggplot(bound_models, aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point() + facet_wrap(~imp_model)
    • bound_models_gather <- bound_models %>% select(Ozone, Solar.R, any_missing, imp_model) %>% gather(key = “variable”, value = “value”, -any_missing, -imp_model)
    • ggplot(bound_models_gather, aes(x = imp_model, y = value)) + geom_boxplot() + facet_wrap(~key) bound_models_gather %>% filter(any_missing == “Missing”) %>% ggplot(aes(x = imp_model, y = value)) + geom_boxplot() + facet_wrap(~key)

Evaluating imputations and models:

  • Can run a standard linear regression, or a regression only with complete.cases
    • aq_cc <- airquality %>% na.omit() %>% bind_shadow() %>% add_label_shadow()
    • aq_imp_lm <- bind_shadow(airquality) %>% add_label_shadow() %>% impute_lm(Ozone ~ Temp + Wind + Month + Day) %>% impute_lm(Solar.R ~ Temp + Wind + Month + Day)
    • bound_models <- bind_rows(cc = aq_cc, imp_lm = aq_imp_lm, .id = “imp_model”)
    • model_summary <- bound_models %>% group_by(imp_model) %>% nest() %>% mutate(mod = map(data, ~lm(Temp ~ Ozone + Solar.R + Wind + Temp + Days + Month data = .)), res = map(mod, residuals), pred = map(mod, predict), tidy = map(mod, broom::tidy))
  • Can then examine the impacts of the various imputation techniques
    • model_summary %>% select(imp_model, tidy) %>% unnest()
    • model_summary %>% select(imp_model, res) %>% unnest() %>% ggplot(aes(x = res, fill = imp_model)) + geom_histogram(position = “dodge”)
    • model_summary %>% select(imp_model, pred) %>% unnest() %>% ggplot(aes(x = pred, fill = imp_model)) + geom_histogram(position = “dodge”)

Example code includes:

# Impute the oceanbuoys data below the range using `impute_below`.
ocean_imp <- naniar::impute_below_all(oceanbuoys)

# Visualise the new missing values
ggplot(ocean_imp, aes(x = wind_ew, y = air_temp_c)) +  
  geom_point()

# Impute and track data with `bind_shadow`, `impute_below_all`, and `add_label_shadow`
ocean_imp_track <- naniar::bind_shadow(oceanbuoys) %>% 
  naniar::impute_below_all() %>%
  naniar::add_label_shadow()

# Look at the imputed values
ocean_imp_track
## # A tibble: 736 x 17
##     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
##    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
##  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
##  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
##  3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
##  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
##  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
##  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
##  7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
##  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
##  9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
## 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
## # ... with 726 more rows, and 9 more variables: year_NA <fct>,
## #   latitude_NA <fct>, longitude_NA <fct>, sea_temp_c_NA <fct>,
## #   air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>,
## #   wind_ns_NA <fct>, any_missing <chr>
ggplot(ocean_imp_track, aes(x=wind_ew, y=air_temp_c, colour=any_missing)) + 
  geom_point()

# Visualise the missingness in wind and air temperature, coloring missing air temp values with air_temp_c_NA
ggplot(ocean_imp_track, aes(x = wind_ew, y = air_temp_c, color = air_temp_c_NA)) + 
    geom_point()

# Visualise humidity and air temp, coloring any missing cases using the variable any_missing
ggplot(ocean_imp_track, aes(x = humidity, y = air_temp_c, color = any_missing)) + 
    geom_point()

# Explore the values of air_temp_c, visualising the amount of missings with `air_temp_c_NA`.
p <- ggplot(ocean_imp_track, aes(x = air_temp_c, fill = air_temp_c_NA)) + 
    geom_histogram()

# Expore the missings in humidity using humidity_NA
p2 <- ggplot(ocean_imp_track,  aes(x = humidity, fill = humidity_NA)) + 
    geom_histogram()

# Explore the missings in air_temp_c according to year, using `facet_wrap(~year)`.
p + facet_wrap(~year)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Explore the missings in humidity according to year, using `facet_wrap(~year)`.
p2 + facet_wrap(~year)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Impute the mean value and track the imputations 
ocean_imp_mean <- naniar::bind_shadow(oceanbuoys) %>% 
  naniar::impute_mean_all() %>% 
  naniar::add_label_shadow()

# Explore the mean values in humidity in the imputed dataset
ggplot(ocean_imp_mean, aes(x = humidity_NA, y = humidity)) + 
    geom_boxplot()

# Explore the values in air temperature in the imputed dataset
ggplot(ocean_imp_mean, aes(x = air_temp_c_NA, y = air_temp_c)) + 
    geom_boxplot()

# Explore imputations in air temperature and humidity, coloring by the variable, any_missing
ggplot(ocean_imp_mean, aes(x = air_temp_c, y = humidity, color = any_missing)) + 
    geom_point()

# Explore imputations in air temperature and humidity, coloring by the variable, any_missing, and faceting by year
ggplot(ocean_imp_mean, aes(x = air_temp_c, y = humidity, color = any_missing)) + 
    geom_point() + 
    facet_wrap(~year)

# Gather the imputed data 
ocean_imp_mean_gather <- naniar::shadow_long(ocean_imp_mean, humidity, air_temp_c)

# Inspect the data
ocean_imp_mean_gather
## # A tibble: 1,472 x 4
##    variable   value       variable_NA   value_NA
##    <chr>      <chr>       <chr>         <chr>   
##  1 air_temp_c 27.14999962 air_temp_c_NA !NA     
##  2 air_temp_c 27.02000046 air_temp_c_NA !NA     
##  3 air_temp_c 27          air_temp_c_NA !NA     
##  4 air_temp_c 26.93000031 air_temp_c_NA !NA     
##  5 air_temp_c 26.84000015 air_temp_c_NA !NA     
##  6 air_temp_c 26.94000053 air_temp_c_NA !NA     
##  7 air_temp_c 27.04000092 air_temp_c_NA !NA     
##  8 air_temp_c 27.11000061 air_temp_c_NA !NA     
##  9 air_temp_c 27.20999908 air_temp_c_NA !NA     
## 10 air_temp_c 27.25       air_temp_c_NA !NA     
## # ... with 1,462 more rows
# Explore the imputations in a histogram 
ggplot(ocean_imp_mean_gather, aes(x = as.numeric(value), fill = value_NA)) + 
    geom_histogram() + 
    facet_wrap(~variable)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Impute humidity and air temperature using wind_ew and wind_ns, and track missing values
ocean_imp_lm_wind <- oceanbuoys %>% 
    naniar::bind_shadow() %>%
    simputation::impute_lm(air_temp_c ~ wind_ew + wind_ns) %>% 
    simputation::impute_lm(humidity ~ wind_ew + wind_ns) %>%
    naniar::add_label_shadow()
    
# Plot the imputed values for air_temp_c and humidity, colored by missingness
ggplot(ocean_imp_lm_wind, aes(x = air_temp_c, y = humidity, color = any_missing)) + 
    geom_point()

# Bind the models together 
bound_models <- bind_rows(mean = ocean_imp_mean,
                          lm_wind = ocean_imp_lm_wind,
                          .id = "imp_model")

# Inspect the values of air_temp and humidity as a scatterplot
ggplot(bound_models, aes(x = air_temp_c, y = humidity, color = any_missing)) +
    geom_point() + 
    facet_wrap(~imp_model)

# Build a model adding year to the outcome
ocean_imp_lm_wind_year <- bind_shadow(oceanbuoys) %>%
    simputation::impute_lm(air_temp_c ~ wind_ew + wind_ns + year) %>%
    simputation::impute_lm(humidity ~ wind_ew + wind_ns + year) %>%
    naniar::add_label_shadow()

# Bind the mean, lm_wind, and lm_wind_year models together
bound_models <- bind_rows(mean = ocean_imp_mean,
                          lm_wind = ocean_imp_lm_wind,
                          lm_wind_year = ocean_imp_lm_wind_year,
                          .id = "imp_model"
                          )

# Explore air_temp and humidity, coloring by any missings, and faceting by imputation model
ggplot(bound_models, aes(x = air_temp_c, y = humidity, color = any_missing)) + 
    geom_point() + 
    facet_wrap(~imp_model)

# Gather the data and inspect the distributions of the values
bound_models_gather <- bound_models %>%
    select(air_temp_c, humidity, any_missing, imp_model) %>%
    gather(key = "key", value = "value", -any_missing, -imp_model)

# Inspect the distribution for each variable, for each model
ggplot(bound_models_gather, aes(x = imp_model, y = value, color = imp_model)) +
    geom_boxplot() + 
    facet_wrap(~key, scales = "free_y")

# Inspect the imputed values
bound_models_gather %>%
    filter(any_missing == "Missing") %>%
    ggplot(aes(x = imp_model, y = value, color = imp_model)) +
    geom_boxplot() + 
    facet_wrap(~key, scales = "free_y")

# Create an imputed dataset using a linear models
ocean_imp_lm_all <- naniar::bind_shadow(oceanbuoys) %>%
    naniar::add_label_shadow() %>%
    simputation::impute_lm(sea_temp_c ~ wind_ew + wind_ns + year + latitude + longitude) %>%
    simputation::impute_lm(air_temp_c ~ wind_ew + wind_ns + year + latitude + longitude) %>%
    simputation::impute_lm(humidity ~ wind_ew + wind_ns + year + latitude + longitude)

# Bind the datasets
bound_models <- bind_rows(imp_lm_wind_year = ocean_imp_lm_wind_year,
                          imp_lm_wind = ocean_imp_lm_wind,
                          imp_lm_all = ocean_imp_lm_all,
                          .id = "imp_model"
                          )
# Look at the models
bound_models
## # A tibble: 2,208 x 18
##    imp_model  year latitude longitude sea_temp_c air_temp_c humidity
##    <chr>     <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>
##  1 imp_lm_w~  1997        0      -110       27.6       27.1     79.6
##  2 imp_lm_w~  1997        0      -110       27.5       27.0     75.8
##  3 imp_lm_w~  1997        0      -110       27.6       27       76.5
##  4 imp_lm_w~  1997        0      -110       27.6       26.9     76.2
##  5 imp_lm_w~  1997        0      -110       27.6       26.8     76.4
##  6 imp_lm_w~  1997        0      -110       27.8       26.9     76.7
##  7 imp_lm_w~  1997        0      -110       28.0       27.0     76.5
##  8 imp_lm_w~  1997        0      -110       28.0       27.1     78.3
##  9 imp_lm_w~  1997        0      -110       28.0       27.2     78.6
## 10 imp_lm_w~  1997        0      -110       28.0       27.2     76.9
## # ... with 2,198 more rows, and 11 more variables: wind_ew <dbl>,
## #   wind_ns <dbl>, year_NA <fct>, latitude_NA <fct>, longitude_NA <fct>,
## #   sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>,
## #   wind_ew_NA <fct>, wind_ns_NA <fct>, any_missing <chr>
# Create the model summary for each dataset
model_summary <- bound_models %>% 
    group_by(imp_model) %>%
    nest() %>%
    mutate(mod = map(data, ~lm(sea_temp_c ~ air_temp_c + humidity + year, data = .)), 
           res = map(mod, residuals), pred = map(mod, predict), tidy = map(mod, broom::tidy)
           )

# Explore the coefficients in the model
model_summary %>% 
    select(imp_model, tidy) %>%
    unnest()
## # A tibble: 12 x 6
##    imp_model        term          estimate std.error statistic   p.value
##    <chr>            <chr>            <dbl>     <dbl>     <dbl>     <dbl>
##  1 imp_lm_wind_year (Intercept)  -614.      48.5        -12.7  2.49e- 33
##  2 imp_lm_wind_year air_temp_c      0.927    0.0235      39.5  3.25e-183
##  3 imp_lm_wind_year humidity        0.0221   0.00427      5.17 3.05e-  7
##  4 imp_lm_wind_year year            0.308    0.0245      12.6  7.14e- 33
##  5 imp_lm_wind      (Intercept) -1742.      56.1        -31.0  1.83e-135
##  6 imp_lm_wind      air_temp_c      0.365    0.0279      13.1  2.73e- 35
##  7 imp_lm_wind      humidity        0.0225   0.00690      3.26 1.17e-  3
##  8 imp_lm_wind      year            0.880    0.0283      31.1  6.79e-136
##  9 imp_lm_all       (Intercept)  -697.      51.8        -13.5  5.04e- 37
## 10 imp_lm_all       air_temp_c      0.890    0.0255      35.0  2.90e-158
## 11 imp_lm_all       humidity        0.0127   0.00463      2.75 6.03e-  3
## 12 imp_lm_all       year            0.351    0.0262      13.4  1.12e- 36
best_model <- "imp_lm_all"

Analyzing Election and Polling Data in R

Chapter 1 - Presidential Job Approval Polls

Introduction:

  • Basic tools for organizing, analyzing, and visualizing polling data in R
    • Data wrangling
    • Prediction of election winners
    • Mapping and regression at the county level
    • Overall ensemble of exercises
  • Presidential approval polls are surveys of the public
    • Tends to have a strong relationship with election outcomes
    • Gallup dataset of approval polls since 1946
  • Can select and filter approval data in R
    • library(tidyverse)
    • data.slim <- data %>% select(variable_1, variable_2, …)
    • data.slim %>% filter(variable_1 == “observation)
    • gallup %>% select(President, Date, Approve) %>% filter(President == “Trump”)

Averaging Job Approval by President:

  • Can group by variables and then take appropriate summaries
    • data %>% group_by(variable)
    • Gallup %>% group_by(President) %>% summarise(MeanApproval = mean(Approve))

Visualizing Trump’s Approval Over Time:

  • Can create averages of available polling and visualize over time
    • Example of the RCP aggregate approval data poll
  • Can convert dates using lubridate
    • library(lubridate)
    • date <- ymd(“2018-01-01”)
    • month(date) # Equal to 1
    • month(date,label = T) # Equal to “Jan”
  • Creating a moving average is a simple but powerful technique - can use zoo and the rollmean() function, followed by ggplot2
    • TrumpApproval %>% mutate(AvgApprove = rollmean(Approve, 10, na.pad=TRUE, align = “right”))

Example code includes:

approval_polls <- readr::read_csv("./RInputFiles/gallup_approval_polls.csv")
## Parsed with column specification:
## cols(
##   President = col_character(),
##   Date = col_character(),
##   Approve = col_double(),
##   Disapprove = col_double(),
##   Inaug = col_character(),
##   Days = col_double()
## )
glimpse(approval_polls)
## Observations: 4,209
## Variables: 6
## $ President  <chr> "Trump", "Trump", "Trump", "Trump", "Trump", "Trump...
## $ Date       <chr> "12/12/2017", "12/9/2017", "12/6/2017", "12/3/2017"...
## $ Approve    <dbl> 36, 36, 37, 35, 34, 37, 38, 36, 39, 37, 39, 37, 37,...
## $ Disapprove <dbl> 59, 59, 59, 60, 60, 56, 55, 57, 56, 57, 55, 57, 57,...
## $ Inaug      <chr> "1/20/17", "1/20/17", "1/20/17", "1/20/17", "1/20/1...
## $ Days       <dbl> 326, 323, 320, 317, 314, 311, 308, 304, 301, 298, 2...
# Select President, Date, and Approve from approval_polls
approval_polls %>% 
    select(President, Date, Approve) %>%
    head()
## # A tibble: 6 x 3
##   President Date       Approve
##   <chr>     <chr>        <dbl>
## 1 Trump     12/12/2017      36
## 2 Trump     12/9/2017       36
## 3 Trump     12/6/2017       37
## 4 Trump     12/3/2017       35
## 5 Trump     11/30/2017      34
## 6 Trump     11/27/2017      37
# Select the President, Date, and Approve columns and filter to observations where President is equal to "Trump"
approval_polls %>% 
    select(President, Date, Approve) %>%
    filter(President == "Trump")
## # A tibble: 108 x 3
##    President Date       Approve
##    <chr>     <chr>        <dbl>
##  1 Trump     12/12/2017      36
##  2 Trump     12/9/2017       36
##  3 Trump     12/6/2017       37
##  4 Trump     12/3/2017       35
##  5 Trump     11/30/2017      34
##  6 Trump     11/27/2017      37
##  7 Trump     11/24/2017      38
##  8 Trump     11/20/2017      36
##  9 Trump     11/17/2017      39
## 10 Trump     11/14/2017      37
## # ... with 98 more rows
# Group the approval_polls dataset by president and summarise a mean of the Approve column 
approval_polls %>%
    group_by(President) %>%
    summarise(Approve = mean(Approve))
## # A tibble: 14 x 2
##    President  Approve
##    <chr>        <dbl>
##  1 Bush 1        60.1
##  2 Bush 2        51.0
##  3 Carter        46.1
##  4 Clinton       55.6
##  5 Eisenhower    64.2
##  6 Ford          45.9
##  7 Johnson       55.4
##  8 Kennedy       70.2
##  9 Nixon         47.1
## 10 Obama         47.8
## 11 Reagan        52.5
## 12 Roosevelt     72.3
## 13 Truman        42.1
## 14 Trump         38.6
# Extract, or "pull," the Approve column as a vector and save it to the object "TrumpApproval"
TrumpApproval <- approval_polls %>% 
    select(President, Date, Approve) %>%
    filter(President == "Trump") %>%
    pull(Approve)

# Take a mean of the TrumpApproval vector
mean(TrumpApproval)
## [1] 38.62963
# Select the relevant columns from the approval_polls dataset and filter them for the Trump presidency
TrumpPolls <- approval_polls %>% 
    select(President, Date, Approve) %>%
    filter(President == "Trump")
  
# Use the months() and mdy() function to get the month of the day each poll was taken
# Group the dataset by month and summarize a mean of Trump's job approval by month
TrumpPolls %>%
    mutate(Month = months(lubridate::mdy(Date))) %>%
    group_by(Month) %>%
    summarise(Approve = mean(Approve))
## # A tibble: 12 x 2
##    Month     Approve
##    <chr>       <dbl>
##  1 April        40.6
##  2 August       35.7
##  3 December     36  
##  4 February     41.6
##  5 January      44  
##  6 July         38  
##  7 June         37.9
##  8 March        40.5
##  9 May          39.9
## 10 November     37.3
## 11 October      36.7
## 12 September    37.5
# Save Donald Trump's approval polling to a separate object 
TrumpApproval <- approval_polls %>% 
    filter(President == "Trump") %>%
    mutate(Date = lubridate::mdy(Date)) %>%
    arrange(Date) 


# use the rollmean() function from the zoo package to get a moving average of the last 10 polls
TrumpApproval <- TrumpApproval %>%
    mutate(AvgApprove = zoo::rollmean(Approve, 10, na.pad=TRUE, align = "right"))


# Use ggplot to graph Trump's average approval over time
ggplot(data = TrumpApproval, aes(x=Date, y=AvgApprove)) + 
    geom_line()
## Warning: Removed 9 rows containing missing values (geom_path).

# Create an moving average of each president's approval rating
AllApproval <- approval_polls %>%
    group_by(President) %>%
    mutate(AvgApprove = zoo::rollmean(Approve, 10, na.pad=TRUE, align = "right"))


# Graph an moving average of each president's approval rating
ggplot(data = AllApproval, aes(x=Days, y=AvgApprove, col=President)) + 
    geom_line()
## Warning: Removed 126 rows containing missing values (geom_path).


Chapter 2 - US House and Senate Polling

Elections and Polling Parties:

  • Can use polling data to predict elections - “generic ballot”for which party is supported in general
    • Data are a mix of publicly available data from pollingreport.com and RCP
    • head(generic_ballot)
    • ggplot(generic_ballot,aes(x=mdy(Date),y=Democrats)) + geom_point()
  • Can explore and wrangle the “generic ballot” dataset

73 Years of “Generic Ballot” Polls:

  • Can analyze the data over time (D-R margin)
    • data %>% group_by(year, month)
    • data %>% group_by(year, month) %>% summarise(support = mean(support))
    • ggplot(data,aes(x=month,y=support)) + geom_point() + geom_smooth(span=0.2)

Calculating and Visualizing Error in Polls:

  • Polls have errors - domain knowledge is important in analyzing the estimates and methodologies revealed in polls
    • Sometimes polls are systemically biased, particularly if similar methodologies are used by all companies
  • Can compare polling data and elections data
    • poll_error <- generic_ballot %>% mutate(Democrats_Poll_Margin = Democrats - Republicans, Democrats_Vote_Margin = Democrats_vote - Republicans_vote)
    • poll_error <- poll_error %>% group_by(Year) %>% summarise(Democrats_Poll_Margin = mean(Democrats_Poll_Margin), Democrats_Vote_Margin = mean(Democrats_Vote_Margin))
    • poll_error %>% mutate(error = Dem.Poll.Margin - Dem.Vote.Margin)
    • rmse <- sqrt(mean(poll_error$error^2))
    • CI <- rmse * 1.96
  • May be helpful to visualize the errors over time
    • ggplot(by_year) + geom_point(aes(x=ElecYear,y=Dem.Poll.Margin,col=“Poll”)) + geom_point(aes(x=ElecYear,y=Dem.Vote.Margin,col=“Vote”)) + geom_errorbar(aes(x=ElecYear,ymin=lower, ymax=upper))

Predicting Winners with Linear Regression:

  • Can use linear regression to predict seats in Congress based on polls
    • model <- lm(Dem.Vote.Margin ~ Dem.Poll.Margin, by_year)

Example code includes:

generic_ballot <- readr::read_csv("./RInputFiles/generic_ballot.csv")
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Democrats = col_double(),
##   Republicans = col_double(),
##   ElecYear = col_double(),
##   ElecDay = col_character(),
##   DaysTilED = col_double(),
##   DemVote = col_double(),
##   RepVote = col_double()
## )
glimpse(generic_ballot)
## Observations: 2,559
## Variables: 8
## $ Date        <chr> "7/4/1945", "7/19/1945", "10/23/1945", "11/28/1945...
## $ Democrats   <dbl> 44, 38, 36, 40, 40, 40, 50, 37, 37, 37, 35, 39, 38...
## $ Republicans <dbl> 31, 31, 51, 34, 34, 33, 39, 37, 32, 34, 34, 35, 35...
## $ ElecYear    <dbl> 1946, 1946, 1946, 1946, 1946, 1946, 1946, 1946, 19...
## $ ElecDay     <chr> "11/5/1946", "11/5/1946", "11/5/1946", "11/5/1946"...
## $ DaysTilED   <dbl> 489, 474, 378, 342, 299, 279, 279, 245, 202, 188, ...
## $ DemVote     <dbl> 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45.0, 45...
## $ RepVote     <dbl> 53.0, 53.0, 53.0, 53.0, 53.0, 53.0, 53.0, 53.0, 53...
# Look at the header and first few rows of the data
head(generic_ballot)
## # A tibble: 6 x 8
##   Date     Democrats Republicans ElecYear ElecDay DaysTilED DemVote RepVote
##   <chr>        <dbl>       <dbl>    <dbl> <chr>       <dbl>   <dbl>   <dbl>
## 1 7/4/1945        44          31     1946 11/5/1~       489      45      53
## 2 7/19/19~        38          31     1946 11/5/1~       474      45      53
## 3 10/23/1~        36          51     1946 11/5/1~       378      45      53
## 4 11/28/1~        40          34     1946 11/5/1~       342      45      53
## 5 1/10/19~        40          34     1946 11/5/1~       299      45      53
## 6 1/30/19~        40          33     1946 11/5/1~       279      45      53
# Filter the election year to 2016 and select the Date, Democrats, and Republicans columns
generic_ballot %>%
    filter(ElecYear == 2016) %>%
    select(Date, Democrats, Republicans)
## # A tibble: 125 x 3
##    Date       Democrats Republicans
##    <chr>          <dbl>       <dbl>
##  1 11/16/2014        41          41
##  2 11/23/2014        39          43
##  3 11/30/2014        40          41
##  4 12/7/2014         40          39
##  5 12/14/2014        37          40
##  6 12/21/2014        39          40
##  7 12/28/2014        39          39
##  8 1/4/2015          40          38
##  9 1/11/2015         38          38
## 10 1/18/2015         39          38
## # ... with 115 more rows
# Mutate a new variable called "Democratic.Margin" equal to the difference between Democrats' vote share and Republicans'
democratic_lead <- generic_ballot %>%
    mutate(Democratic.Margin = Democrats - Republicans)

# Take a look at that new variable!
democratic_lead %>%
    select(Democratic.Margin)
## # A tibble: 2,559 x 1
##    Democratic.Margin
##                <dbl>
##  1                13
##  2                 7
##  3               -15
##  4                 6
##  5                 6
##  6                 7
##  7                11
##  8                 0
##  9                 5
## 10                 3
## # ... with 2,549 more rows
# Group the generic ballot dataset by year and summarise an average of the Democratic.Margin variable
over_time <- democratic_lead %>% 
    group_by(ElecYear) %>%
    summarize(Democratic.Margin = mean(Democratic.Margin))

# Explore the data.frame
head(over_time)
## # A tibble: 6 x 2
##   ElecYear Democratic.Margin
##      <dbl>             <dbl>
## 1     1946             0.524
## 2     1948            -0.333
## 3     1950            11.1  
## 4     1952            -0.975
## 5     1954             9.40 
## 6     1956            11.1
# Create a month and year variable for averaging polls by approximate date
timeseries <- democratic_lead %>%
    mutate(Date = lubridate::mdy(Date), month = lubridate::month(Date), yr = lubridate::year(Date))

# Now group the polls by their month and year, then summarise
timeseries <- timeseries %>%
    group_by(yr, month) %>%
    summarise(Democratic.Margin = mean(Democratic.Margin))


# Mutate a new variable to use a date summary for the monthly average
timeseries_plot <- timeseries %>%
    mutate(time = sprintf("%s-%s-%s", yr, month, "01"))

# Plot the line over time
ggplot(timeseries_plot, aes(x=lubridate::ymd(time), y=Democratic.Margin)) +
    geom_line()

# Make a ggplot with points for monthly polling averages and one trend line running through the entire time series
ggplot(timeseries_plot, aes(x=lubridate::ymd(time), y=Democratic.Margin)) +
    geom_point() + 
    geom_smooth(span=0.2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Mutate two variables for the Democrats' margin in polls and election day votes
poll_error <- generic_ballot %>%
    mutate(Dem.Poll.Margin = Democrats - Republicans,
           Dem.Vote.Margin = DemVote - RepVote
           )

# Average those two variables per year and mutate the "error" variable
poll_error <- poll_error %>%
    group_by(ElecYear) %>%
    summarise(Dem.Poll.Margin = mean(Dem.Poll.Margin), Dem.Vote.Margin = mean(Dem.Vote.Margin)) %>%
    mutate(error = Dem.Poll.Margin - Dem.Vote.Margin)

# Calculate the room-mean-square error of the error variable
rmse <- sqrt(mean(poll_error$error^2))

# Multiply the RMSE by 1.96 to get the 95% confidence interval, or "margin of error"
CI <- rmse * 1.96

# Add variables to our dataset for the upper and lower bound of the `Dem.Poll.Margin` variable
by_year <- poll_error %>%
    mutate(upper = Dem.Poll.Margin + CI, lower = Dem.Poll.Margin - CI)


# Plot estimates for Dem.Poll.Margin and Dem.Vote.Margin on the y axis for each year on the x axis with geom_point
ggplot(by_year) + 
    geom_point(aes(x=ElecYear, y=Dem.Poll.Margin, col="Poll")) +
    geom_point(aes(x=ElecYear, y=Dem.Vote.Margin, col="Vote")) +
    geom_errorbar(aes(x=ElecYear, ymin=lower, ymax=upper))

# Fit a model predicting Democratic vote margin with Democratic poll margin
model <- lm(Dem.Vote.Margin ~ Dem.Poll.Margin, data=by_year)
  
# Evaluate the model
summary(model)
## 
## Call:
## lm(formula = Dem.Vote.Margin ~ Dem.Poll.Margin, data = by_year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9701 -3.2791 -0.1947  3.0657 10.4314 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.80939    1.19857  -0.675    0.504    
## Dem.Poll.Margin  0.52693    0.09582   5.499 3.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.758 on 34 degrees of freedom
## Multiple R-squared:  0.4708, Adjusted R-squared:  0.4552 
## F-statistic: 30.24 on 1 and 34 DF,  p-value: 3.855e-06
# Make a new data.frame that has our prediction variable and value
predictdata <- data.frame("Dem.Poll.Margin" = 5)

# Make the prediction with the coefficients from our model
predict(model, predictdata)
##        1 
## 1.825248

Chapter 3 - Election Results and Political Demography

2016 Presidential Election:

  • County-level results are made available by secretaries of state
    • Can be combined with census data to draw findings - for example, is there a correlation between race and vote share by county?
  • Data are available in the chloroplethr package
    • county_merged <- left_join(df_county_demographics, uspres_county, by = “county.fips”)
    • ggplot(county_merged, aes(x=percent_white,y=Dem.pct)) + geom_point()
    • ggplot(county_merged, aes(x=percent_white,y=Dem.pct)) + geom_point() + geom_smooth(method=“lm”)

Making County-Level Maps in R:

  • Mapping can be helpful for identifying trends and areas of interest in the data - a few options include
    • choroplethr
    • geom_sf()
    • leaflet
  • The choroplethr package allows for easy creation of maps with minimal pre-processing
    • library(choroplethr)
    • county_map <- county_merged %>% dplyr::rename(“region” = county.fips, “value” = Dem.pct) # names need to be region and value
    • county_choropleth(county_map)

Analyzing Results with Linear Regression:

  • Can further analyze findings using linear regression; both understanding past results and predicting future
    • fit <- lm(Dem.pct ~ percent_white, data=county_merged)
    • summary(fit)

2016 Brexit Referendum:

  • Data wrangling, modeling, and visualization of 2016 Brexit vote
  • Can either run a short-term average, or a LOESS with various windows
    • head(brexit_polls)
    • ggplot(brexit_polls, aes(x = mdy(Date), y = Remain - Leave)) + geom_point() + geom_smooth(method = ‘loess’)

Example code includes:

uspres_results <- readr::read_csv("./RInputFiles/us_pres_2016_by_county.csv")
## Parsed with column specification:
## cols(
##   county.fips = col_double(),
##   county.name = col_character(),
##   state.name = col_character(),
##   party = col_character(),
##   vote.count = col_double(),
##   county.total.count = col_double(),
##   national.party.percent = col_double(),
##   national.count = col_double(),
##   is.national.winner = col_logical()
## )
glimpse(uspres_results)
## Observations: 9,297
## Variables: 9
## $ county.fips            <dbl> 45001, 45001, 45001, 22001, 22001, 2200...
## $ county.name            <chr> "abbeville", "abbeville", "abbeville", ...
## $ state.name             <chr> "south carolina", "south carolina", "so...
## $ party                  <chr> "D", "O", "R", "D", "O", "R", "D", "O",...
## $ vote.count             <dbl> 3741, 271, 6763, 5638, 589, 21162, 6740...
## $ county.total.count     <dbl> 10775, 10775, 10775, 27389, 27389, 2738...
## $ national.party.percent <dbl> 48.098104, 5.789663, 46.112232, 48.0981...
## $ national.count         <dbl> 135851595, 135851595, 135851595, 135851...
## $ is.national.winner     <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE,...
# Deselect the is.national.winner, national.count, and national.party.percent variables
uspres_results.slim <- uspres_results %>%
    select(-c(is.national.winner, national.count, national.party.percent))


# Spread party and votes to their own columns
uspres_county <- uspres_results.slim %>%
    tidyr::spread(key=party,value=vote.count)

# Add a variable to the uspres_county dataset to store the Democrat's percentage of votes
uspres_county <- uspres_county %>%
    mutate(Dem.pct = D/county.total.count)


# Load the county demographic data
data(df_county_demographics, package="choroplethr")

# Look at the demographic data
head(df_county_demographics)
##   region total_population percent_white percent_black percent_asian
## 1   1001            54907            76            18             1
## 2   1003           187114            83             9             1
## 3   1005            27321            46            46             0
## 4   1007            22754            75            22             0
## 5   1009            57623            88             1             0
## 6   1011            10746            22            71             0
##   percent_hispanic per_capita_income median_rent median_age
## 1                2             24571         668       37.5
## 2                4             26766         693       41.5
## 3                5             16829         382       38.3
## 4                2             17427         351       39.4
## 5                8             20730         403       39.6
## 6                6             18628         276       39.6
# Rename the 'region' variable in df_county_demographics to "county.fips"
df_county_demographics <- df_county_demographics %>%
    rename("county.fips" = region)

# Join county demographic with vote share data via its FIPS code
county_merged <- left_join(df_county_demographics, uspres_county, by = "county.fips")
head(county_merged)
##   county.fips total_population percent_white percent_black percent_asian
## 1        1001            54907            76            18             1
## 2        1003           187114            83             9             1
## 3        1005            27321            46            46             0
## 4        1007            22754            75            22             0
## 5        1009            57623            88             1             0
## 6        1011            10746            22            71             0
##   percent_hispanic per_capita_income median_rent median_age county.name
## 1                2             24571         668       37.5     autauga
## 2                4             26766         693       41.5     baldwin
## 3                5             16829         382       38.3     barbour
## 4                2             17427         351       39.4        bibb
## 5                8             20730         403       39.6      blount
## 6                6             18628         276       39.6     bullock
##   state.name county.total.count     D    O     R    Dem.pct
## 1    alabama              24973  5936  865 18172 0.23769671
## 2    alabama              95215 18458 3874 72883 0.19385601
## 3    alabama              10469  4871  144  5454 0.46527844
## 4    alabama               8819  1874  207  6738 0.21249575
## 5    alabama              25588  2156  573 22859 0.08425825
## 6    alabama               4710  3530   40  1140 0.74946921
# plot percent_white and Dem.pct on the x and y axes. add points and a trend line
ggplot(county_merged, aes(x=percent_white, y=Dem.pct)) +
    geom_point() +
    geom_smooth(method="lm")
## Warning: Removed 44 rows containing non-finite values (stat_smooth).
## Warning: Removed 44 rows containing missing values (geom_point).

# Rename the county.fips and Dem.pct variables from our dataset to "region" and "value"
county_map <- county_merged %>%
    rename("region" = county.fips, "value" = Dem.pct)

# Create the map with choroplethrMaps's county_choropleth()
democratic_map <- choroplethr::county_choropleth(county_map)
## Warning in self$bind(): The following regions were missing and are being
## set to NA: 17049, 17065, 2050, 17079, 2105, 17175, 2122, 17101, 17127,
## 2150, 17159, 17165, 2164, 2180, 2188, 2240, 2090, 2198, 15005, 2100, 2170,
## 17151, 2016, 2060, 2290, 17059, 17067, 2282, 17185, 2070, 2110, 2130, 2185,
## 2195, 2220, 2230, 2020, 2068, 2013, 2261, 2270, 2275, 17003, 17047
# Print the map
democratic_map

# Rename variables from our dataset
county_map <- county_merged %>%
    rename("region" = county.fips, "value" = percent_white)

# Create the map with choroplethr's county_choropleth()
white_map <- choroplethr::county_choropleth(county_map)

# Graph the two maps (democratic_map and white_map) from the previous exercises side-by-side
gridExtra::grid.arrange(democratic_map, white_map)

# Fit a linear model to predict Dem.pct dependent on percent_white in each county
fit <- lm(Dem.pct ~ percent_white, data=county_merged)

# Evaluate the model
summary(fit)
## 
## Call:
## lm(formula = Dem.pct ~ percent_white, data = county_merged)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39987 -0.08303 -0.00903  0.07281  0.47761 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.6719046  0.0090408   74.32   <2e-16 ***
## percent_white -0.0045684  0.0001123  -40.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1227 on 3097 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.3482, Adjusted R-squared:  0.348 
## F-statistic:  1655 on 1 and 3097 DF,  p-value: < 2.2e-16
# Fit a linear model to predict Dem.pct dependent on percent_white and per_capita_income in each county
fit <- lm(Dem.pct ~ percent_white + per_capita_income, data=county_merged)

# Evaluate the model
summary(fit)
## 
## Call:
## lm(formula = Dem.pct ~ percent_white + per_capita_income, data = county_merged)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48529 -0.06838  0.00214  0.06847  0.39700 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.006e-01  1.094e-02   45.77   <2e-16 ***
## percent_white     -5.080e-03  1.053e-04  -48.25   <2e-16 ***
## per_capita_income  8.961e-06  3.727e-07   24.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1127 on 3096 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.4508, Adjusted R-squared:  0.4505 
## F-statistic:  1271 on 2 and 3096 DF,  p-value: < 2.2e-16
brexit_polls <- readr::read_csv("./RInputFiles/brexit_polls.csv")
## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Remain = col_double(),
##   Leave = col_double()
## )
glimpse(brexit_polls)
## Observations: 35
## Variables: 3
## $ Date   <chr> "6/23/16", "6/22/16", "6/22/16", "6/22/16", "6/22/16", ...
## $ Remain <dbl> 52, 55, 51, 49, 44, 54, 48, 41, 45, 42, 53, 45, 44, 44,...
## $ Leave  <dbl> 48, 45, 49, 46, 45, 46, 42, 43, 44, 44, 46, 42, 43, 44,...
brexit_results <- readr::read_csv("./RInputFiles/brexit_results.csv")
## Parsed with column specification:
## cols(
##   Seat = col_character(),
##   con_2015 = col_double(),
##   lab_2015 = col_double(),
##   ld_2015 = col_double(),
##   ukip_2015 = col_double(),
##   leave_share = col_double(),
##   born_in_uk = col_double(),
##   male = col_double(),
##   unemployed = col_double(),
##   degree = col_double(),
##   age_18to24 = col_double()
## )
glimpse(brexit_results)
## Observations: 632
## Variables: 11
## $ Seat        <chr> "Aldershot", "Aldridge-Brownhills", "Altrincham an...
## $ con_2015    <dbl> 50.592, 52.050, 52.994, 43.979, 60.788, 22.418, 52...
## $ lab_2015    <dbl> 18.333, 22.369, 26.686, 34.781, 11.197, 41.022, 18...
## $ ld_2015     <dbl> 8.824, 3.367, 8.383, 2.975, 7.192, 14.828, 5.984, ...
## $ ukip_2015   <dbl> 17.867, 19.624, 8.011, 15.887, 14.438, 21.409, 18....
## $ leave_share <dbl> 57.89777, 67.79635, 38.58780, 65.29912, 49.70111, ...
## $ born_in_uk  <dbl> 83.10464, 96.12207, 90.48566, 97.30437, 93.33793, ...
## $ male        <dbl> 49.89896, 48.92951, 48.90621, 49.21657, 48.00189, ...
## $ unemployed  <dbl> 3.637000, 4.553607, 3.039963, 4.261173, 2.468100, ...
## $ degree      <dbl> 13.870661, 9.974114, 28.600135, 9.336294, 18.77559...
## $ age_18to24  <dbl> 9.406093, 7.325850, 6.437453, 7.747801, 5.734730, ...
# Filter the dataset to polls only released after June 16th, 2016, and mutate a variable for the Remain campaign's lead
brexit_average <- brexit_polls %>%
    filter(lubridate::mdy(Date)>lubridate::ymd("2016-06-16") )%>%
    mutate(RemainLead = Remain - Leave)  

# Average the last seven days of polling
mean(brexit_average$RemainLead)
## [1] 2.857143
# Summarise the Remain lead from the entire month of the referendum 
ggplot(brexit_polls, aes(x=lubridate::mdy(Date), y=Remain-Leave)) +
    geom_point() + 
    geom_smooth(method='loess')

# Familiarize yourself with the data using the head() function
head(brexit_results)
## # A tibble: 6 x 11
##   Seat  con_2015 lab_2015 ld_2015 ukip_2015 leave_share born_in_uk  male
##   <chr>    <dbl>    <dbl>   <dbl>     <dbl>       <dbl>      <dbl> <dbl>
## 1 Alde~     50.6     18.3    8.82     17.9         57.9       83.1  49.9
## 2 Aldr~     52.0     22.4    3.37     19.6         67.8       96.1  48.9
## 3 Altr~     53.0     26.7    8.38      8.01        38.6       90.5  48.9
## 4 Ambe~     44.0     34.8    2.98     15.9         65.3       97.3  49.2
## 5 Arun~     60.8     11.2    7.19     14.4         49.7       93.3  48.0
## 6 Ashf~     22.4     41.0   14.8      21.4         70.5       97.0  49.2
## # ... with 3 more variables: unemployed <dbl>, degree <dbl>,
## #   age_18to24 <dbl>
# Chart the counstituency-by-constituency relationship between voting for the Labour Party and voting to leave the EU
ggplot(brexit_results,aes(x=lab_2015, y=leave_share)) + 
  geom_point()

# Show the relationship between UKIP and Leave vote share with points and a line representing the linear relationship between the variables
ggplot(brexit_results,aes(x=ukip_2015, y=leave_share)) + 
    geom_point() +
    geom_smooth(method = "lm")

# predict leave's share with the percentage of a constituency that holds a college degree and its 2015 UKIP vote share
model.multivar <- lm(leave_share ~ ukip_2015 + degree, brexit_results)
summary(model.multivar)
## 
## Call:
## lm(formula = leave_share ~ ukip_2015 + degree, data = brexit_results)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.976  -2.076   0.283   2.479  10.280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.52977    1.02503   55.15   <2e-16 ***
## ukip_2015    0.72948    0.04250   17.16   <2e-16 ***
## degree      -0.80594    0.02833  -28.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.718 on 570 degrees of freedom
##   (59 observations deleted due to missingness)
## Multiple R-squared:  0.8821, Adjusted R-squared:  0.8817 
## F-statistic:  2133 on 2 and 570 DF,  p-value: < 2.2e-16

Chapter 4 - Predicting the Future of Politics

US House 2018:

  • Can make predictions based on publicly available polling data and also make predictions using past data to evaluate general error rates
    • polls_2018 %>% filter(date > “2018-06-01”)
    • polls_2018 %>% mutate(Dem.Margin = Dem - Rep)
    • polls_2018 %>% pull(Dem.Margin)
    • mean(polls_2018$Dem.Margin)
  • Can also extend with group_by() function

Training a Model to Predict Future with Polls:

  • Can start with the base model
    • lm(Dem.Vote.Margin ~ Dem.Poll.Margin)
  • There may be additional variables however; perhaps party in office systemically leads to over/under performance of polls
    • ggplot(generic_ballot,aes(x=Dem.Poll.Margin,y=Dem.Vote.Margin, col=party_in_power) + geom_text(aes(label=ElecYear)) + geom_smooth(method=‘lm’)
    • model <- lm(Dem.Vote.Margin ~ Dem.Poll.Margin + party_in_power, data=polls_predict)
    • predict(model, data.frame(Dem.Poll.Margin = 8, party_in_power=-1))
  • May want to extend with the predictive margin of error
    • sqrt(mean(c(model\(fitted.values - data\)actual_results)^2)) * 1.96
    • sqrt(mean(c(model\(fitted.values - polls_predict\)Dem.Vote.Margin)^2)) *1.96

Presidency in 2020:

  • Presidential elections have many moving parts and with a small sample size of data
    • Popular vote - nationwide tally
    • Electoral vote - tally by states (weighted by size of state)
  • Popular vote can be modeled based on approval rating, economic growth, and length of party incumbency
    • lm(vote_share ~ pres_approve + q2_gdp + two_plus_terms, pres_elecs)
    • ggplot(pres_elecs,aes(x=predict,y=vote_share,label=Year)) + geom_abline() + geom_text()
    • sqrt(mean(c(pres_elecs\(predict-pres_elecs\)vote_share)^2)) * 1.96

Wrap-up:

  • Approval polls - wrangling and visualizing data
  • Polls and linear regression
  • Mapping election results and running multi-regressions
  • Prediction and applied examples

Example code includes:

polls_2018 <- tibble::tibble(Democrat=c(45, 47, 49, 44, 41, 48, 45, 44, 45, 51, 42, 52, 46, 44, 41, 42, 44, 42, 44, 44, 42, 41, 51, 47, 49, 45, 44, 45, 48, 42, 42, 47, 42, 44, 47, 43, 50, 43, 43, 41, 45, 44, 44, 42, 45, 50, 48, 48, 43, 45, 48, 46, 48, 44, 43, 44, 49, 42, 39, 42, 44, 43, 40, 42, 42, 38, 43, 44, 39, 42, 47, 42, 43, 43, 48, 49, 46, 43, 43, 45, 44, 43, 44, 44, 47, 44, 38, 42, 43, 43, 41, 41, 42, 42, 50, 50, 44, 46, 44, 40, 42, 43, 38, 43, 49, 43, 38, 50, 44, 40, 37, 41, 47, 54, 46, 43, 38, 42, 39, 38, 49, 49, 43, 38, 42, 45, 47, 42, 37, 41, 38, 43, 42, 51, 51, 42, 37, 41, 53, 46, 44, 40, 44, 42, 38, 44, 44, 39, 44, 56, 51, 44, 51, 37, 41, 50, 42, 37, 40, 41, 36, 42, 37, 42, 43, 43, 42, 38, 44, 51, 40, 38, 38, 51, 51, 39, 40, 43, 44, 50, 40, 36, 42, 41, 42, 54, 40, 43, 39, 41, 40, 48, 42, 49, 39, 43, 40, 40, 39, 43, 40, 40, 39, 49, 41, 46, 41, 40, 47, 39, 51, 43, 39, 44, 40, 40, 40, 50, 42, 39, 43, 37, 43, 47, 41, 48, 42, 38, 43, 38, 42, 50, 41, 42, 39, 43, 38, 41, 40, 42, 49, 42, 40, 42, 38, 41, 47, 39, 50, 40, 47, 47, 38, 40, 45, 40, 43, 41, 48, 47, 46, 46, 49, 45, 48), 
                             Republican=c(39, 34, 38, 38, 37, 43, 36, 40, 36, 42, 38, 41, 39, 41, 39, 36, 40, 36, 37, 38, 37, 38, 39, 40, 37, 37, 35, 37, 40, 38, 37, 34, 38, 34, 37, 40, 41, 35, 38, 38, 36, 37, 37, 36, 39, 42, 43, 41, 37, 35, 43, 40, 39, 38, 32, 37, 42, 39, 37, 37, 37, 38, 37, 36, 37, 37, 38, 35, 38, 35, 44, 39, 37, 34, 43, 41, 38, 34, 38, 34, 35, 38, 34, 39, 43, 36, 30, 36, 36, 34, 37, 37, 35, 35, 39, 44, 39, 41, 38, 37, 38, 38, 30, 36, 40, 37, 31, 41, 37, 38, 31, 39, 32, 38, 39, 35, 30, 38, 30, 39, 41, 38, 37, 31, 38, 39, 45, 37, 31, 37, 30, 37, 37, 39, 41, 36, 32, 38, 39, 40, 37, 31, 36, 36, 31, 36, 35, 27, 34, 38, 40, 36, 36, 29, 36, 37, 36, 31, 38, 35, 30, 36, 28, 33, 36, 40, 34, 31, 36, 36, 33, 30, 39, 40, 40, 36, 31, 38, 37, 35, 33, 28, 36, 33, 37, 38, 33, 37, 32, 39, 34, 37, 38, 43, 33, 37, 31, 37, 33, 38, 32, 36, 33, 35, 36, 40, 33, 38, 40, 34, 42, 36, 34, 37, 34, 40, 34, 40, 40, 32, 40, 35, 39, 41, 35, 38, 40, 35, 37, 36, 39, 40, 37, 39, 33, 39, 36, 37, 33, 35, 38, 37, 35, 36, 35, 41, 42, 34, 41, 44, 42, 41, 32, 40, 38, 37, 40, 39, 43, 38, 41, 43, 41, 42, 40), 
                             end_date=as.Date(c('2018-08-28', '2018-08-28', '2018-08-21', '2018-08-21', '2018-08-21', '2018-08-19', '2018-08-18', '2018-08-14', '2018-08-14', '2018-08-13', '2018-08-12', '2018-08-12', '2018-08-12', '2018-08-07', '2018-08-07', '2018-08-06', '2018-07-31', '2018-07-31', '2018-07-30', '2018-07-24', '2018-07-24', '2018-07-23', '2018-07-23', '2018-07-22', '2018-07-22', '2018-07-17', '2018-07-17', '2018-07-14', '2018-07-11', '2018-07-10', '2018-07-10', '2018-07-10', '2018-07-03', '2018-07-02', '2018-07-02', '2018-07-01', '2018-07-01', '2018-06-29', '2018-06-26', '2018-06-26', '2018-06-25', '2018-06-24', '2018-06-19', '2018-06-19', '2018-06-18', '2018-06-17', '2018-06-17', '2018-06-13', '2018-06-12', '2018-06-12', '2018-06-12', '2018-06-10', '2018-06-06', '2018-06-05', '2018-06-05', '2018-06-04', '2018-05-30', '2018-05-29', '2018-05-29', '2018-05-29', '2018-05-22', '2018-05-22', '2018-05-22', '2018-05-19', '2018-05-15', '2018-05-15', '2018-05-14', '2018-05-08', '2018-05-08', '2018-05-07', '2018-05-05', '2018-05-01', '2018-05-01', '2018-05-01', '2018-05-01', '2018-04-30', '2018-04-30', '2018-04-24', '2018-04-24', '2018-04-24', '2018-04-23', '2018-04-17', '2018-04-17', '2018-04-13', '2018-04-11', '2018-04-10', '2018-04-10', '2018-04-07', '2018-04-03', '2018-04-03', '2018-04-01', '2018-03-27', '2018-03-27', '2018-03-27', '2018-03-25', '2018-03-25', '2018-03-21', '2018-03-21', '2018-03-20', '2018-03-20', '2018-03-19', '2018-03-13', '2018-03-13', '2018-03-12', '2018-03-08', '2018-03-06', '2018-03-06', '2018-03-05', '2018-03-05', '2018-02-27', '2018-02-27', '2018-02-26', '2018-02-24', '2018-02-23', '2018-02-21', '2018-02-20', '2018-02-20', '2018-02-13', '2018-02-13', '2018-02-12', '2018-02-11', '2018-02-07', '2018-02-06', '2018-02-06', '2018-02-04', '2018-02-01', '2018-01-30', '2018-01-30', '2018-01-30', '2018-01-23', '2018-01-23', '2018-01-21', '2018-01-20', '2018-01-18', '2018-01-18', '2018-01-16', '2018-01-16', '2018-01-16', '2018-01-15', '2018-01-10', '2018-01-09', '2018-01-09', '2018-01-05', '2018-01-02', '2018-01-02', '2017-12-26', '2017-12-19', '2017-12-19', '2017-12-18', '2017-12-17', '2017-12-12', '2017-12-12', '2017-12-12', '2017-12-12', '2017-12-11', '2017-12-07', '2017-12-05', '2017-12-05', '2017-12-03', '2017-11-28', '2017-11-28', '2017-11-25', '2017-11-22', '2017-11-21', '2017-11-19', '2017-11-15', '2017-11-14', '2017-11-14', '2017-11-11', '2017-11-09', '2017-11-07', '2017-11-07', '2017-11-06', '2017-11-05', '2017-11-01', '2017-10-31', '2017-10-31', '2017-10-30', '2017-10-30', '2017-10-24', '2017-10-24', '2017-10-24', '2017-10-23', '2017-10-16', '2017-10-16', '2017-10-15', '2017-10-10', '2017-10-09', '2017-10-03', '2017-10-01', '2017-09-26', '2017-09-25', '2017-09-24', '2017-09-20', '2017-09-19', '2017-09-17', '2017-09-12', '2017-09-11', '2017-09-05', '2017-09-03', '2017-08-29', '2017-08-28', '2017-08-22', '2017-08-21', '2017-08-19', '2017-08-17', '2017-08-15', '2017-08-14', '2017-08-12', '2017-08-08', '2017-08-06', '2017-08-06', '2017-08-01', '2017-07-29', '2017-07-25', '2017-07-24', '2017-07-18', '2017-07-17', '2017-07-15', '2017-07-11', '2017-07-09', '2017-07-04', '2017-06-30', '2017-06-27', '2017-06-27', '2017-06-25', '2017-06-24', '2017-06-20', '2017-06-19', '2017-06-13', '2017-06-12', '2017-06-11', '2017-06-06', '2017-06-02', '2017-05-30', '2017-05-30', '2017-05-23', '2017-05-22', '2017-05-16', '2017-05-14', '2017-05-14', '2017-05-11', '2017-05-09', '2017-05-06', '2017-05-02', '2017-04-30', '2017-04-25', '2017-04-25', '2017-04-25', '2017-04-24', '2017-04-20', '2017-04-18', '2017-04-18', '2017-04-15', '2017-04-12', '2017-04-11', '2017-04-09', '2017-04-01', '2017-03-28', '2017-03-27', '2017-03-12', '2017-02-22', '2017-02-08', '2017-01-31', '2017-01-24'))
                             )

# average all of the generic ballot polls that have been taken since June
polls_2018 %>% 
    filter(lubridate::month(end_date) > 6, lubridate::year(end_date)==2018) %>% 
    mutate(Dem.Margin = Democrat - Republican) %>%
    pull(Dem.Margin) %>% 
    mean()
## [1] 7.081081
# Filter the dataset to include polls from August and September
# Mutate a variable for the Democratic vote margin in that year
polls_predict <- generic_ballot %>%
    filter(lubridate::month(lubridate::mdy(Date)) %in% c(8, 9), ElecYear >= 1980) %>%
    mutate(Dem.Poll.Margin = Democrats - Republicans,
           Dem.Vote.Margin = DemVote - RepVote
           ) %>%
    group_by(ElecYear) %>%
    summarise(Dem.Poll.Margin = mean(Dem.Poll.Margin),
              Dem.Vote.Margin = mean(Dem.Vote.Margin)
              ) %>%
    arrange(ElecYear) %>%
    mutate(error=Dem.Poll.Margin - Dem.Vote.Margin, 
           party_in_power=c(-1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1)
           )


# Fit a model to predict Democrats' November vote margin with the Democratic poll margin and party in power variable
model <- lm(Dem.Vote.Margin ~ Dem.Poll.Margin + party_in_power, data=polls_predict)

# Evaluate the model
summary(model)
## 
## Call:
## lm(formula = Dem.Vote.Margin ~ Dem.Poll.Margin + party_in_power, 
##     data = polls_predict)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3893 -2.4283 -0.2004  2.4982  4.6166 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.1168     1.1244  -1.883 0.078079 .  
## Dem.Poll.Margin   0.8856     0.2070   4.278 0.000577 ***
## party_in_power   -2.1348     0.8809  -2.423 0.027601 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.238 on 16 degrees of freedom
## Multiple R-squared:  0.7498, Adjusted R-squared:  0.7185 
## F-statistic: 23.98 on 2 and 16 DF,  p-value: 1.535e-05
# Make a prediction for November if Democrats are up 7.5 points in the generic ballot and the party_in_power is the Republicans (-1)
predict(model, data.frame(Dem.Poll.Margin = 7.5, party_in_power=-1))
##        1 
## 6.660162
# Multiply the root-mean-square error by 1.96
sqrt(mean(c(model$fitted.values - polls_predict$Dem.Vote.Margin)^2)) * 1.96
## [1] 5.823251
pres_elecs <- tibble::tibble(Year=c(2016, 2012, 2008, 2004, 2000, 1996, 1992, 1988, 1984, 1980, 1976, 1972, 1968, 1964, 1960, 1956, 1952, 1948), 
                             q2_gdp=c(2.3, 1.3, 1.3, 2.6, 8, 7.1, 4.3, 5.2, 7.1, -7.9, 3, 9.8, 7, 4.7, -1.9, 3.2, 0.4, 7.5), 
                             pres_approve=c(7, -0.8, -37, -0.5, 19.5, 15.5, -18, 10, 20, -21.7, 5, 26, -5, 60.3, 37, 53.5, -27, -6), 
                             two_plus_terms=c(1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1), 
                             vote_share=c(51.1, 51.96, 46.3, 51.2, 50.3, 54.7, 46.5, 53.9, 59.2, 44.7, 48.9, 61.8, 49.6, 61.3, 49.9, 57.8, 44.5, 52.4)
                             )
pres_elecs
## # A tibble: 18 x 5
##     Year q2_gdp pres_approve two_plus_terms vote_share
##    <dbl>  <dbl>        <dbl>          <dbl>      <dbl>
##  1  2016    2.3          7                1       51.1
##  2  2012    1.3         -0.8              0       52.0
##  3  2008    1.3        -37                1       46.3
##  4  2004    2.6         -0.5              0       51.2
##  5  2000    8           19.5              1       50.3
##  6  1996    7.1         15.5              0       54.7
##  7  1992    4.3        -18                1       46.5
##  8  1988    5.2         10                1       53.9
##  9  1984    7.1         20                0       59.2
## 10  1980   -7.9        -21.7              0       44.7
## 11  1976    3            5                1       48.9
## 12  1972    9.8         26                0       61.8
## 13  1968    7           -5                1       49.6
## 14  1964    4.7         60.3              0       61.3
## 15  1960   -1.9         37                1       49.9
## 16  1956    3.2         53.5              0       57.8
## 17  1952    0.4        -27                1       44.5
## 18  1948    7.5         -6                1       52.4
#  Make a plot with points representing a year's presidential approval and vote share and a line running through them to show the linear relationship
ggplot(pres_elecs, aes(x=pres_approve, y=vote_share, label=Year)) + 
    geom_text() + 
    geom_smooth(method='lm')

# Make a model that predict the vote_share variable with pres_approve, q2_gdp, and two_plus_terms
fit <- lm(vote_share ~ pres_approve + q2_gdp + two_plus_terms, pres_elecs)

# Evaluate the model
summary(fit)
## 
## Call:
## lm(formula = vote_share ~ pres_approve + q2_gdp + two_plus_terms, 
##     data = pres_elecs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5897 -1.1466 -0.1435  1.6203  2.5569 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    51.43054    0.80876  63.592  < 2e-16 ***
## pres_approve    0.10132    0.02048   4.947 0.000215 ***
## q2_gdp          0.56573    0.11697   4.837 0.000264 ***
## two_plus_terms -4.04250    0.99566  -4.060 0.001170 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.894 on 14 degrees of freedom
## Multiple R-squared:  0.8944, Adjusted R-squared:  0.8718 
## F-statistic: 39.53 on 3 and 14 DF,  p-value: 4.379e-07
# Save the predicted vote shares to a variable called predict
pres_elecs$predict <- predict(fit, pres_elecs)

# Graph the predictions and vote shares with a label for each election year
ggplot(pres_elecs,aes(x=predict, y=vote_share, label=Year)) + 
    geom_abline() +
    geom_text()

# Calculate the model's root-mean-square error
sqrt(mean(c(pres_elecs$predict - pres_elecs$vote_share)^2)) * 1.96
## [1] 3.273301
# Make a prediction for hypothetical data
predict(fit, data.frame(pres_approve=-15, q2_gdp=2, two_plus_terms=0))
##        1 
## 51.04215

Analyzing US Census Data in R

Chapter 1 - Census Data in R with tidycensus

Overview:

  • Acquiring census and ACS data through tidycensus, wrangling with tidyverse, acquiring boundary data with tigris, and visualizing with ggplot2
  • The dicennial (every 10 years) census is a full count of the population
    • ACS is a detailed study of 3 million households per year
  • Need to acquire a census API key and provide to the tidycensus function
    • state_pop <- get_decennial(geography = “state”, variables = “P001001”) # geography is the level of aggregation and variables are the desired variables
    • state_income <- get_acs(geography = “state”, variables = “B19013_001”) # default dataset is the 5-year 2012-2016 data; returns both an estimate and an extra column “moe” (margin of error)

Basic tidycensus functionality:

  • Can acquire census data by legal entities (county, state) or statistical entities (census tracks)
    • geography = “county”
    • geography = “tract”
    • county_income <- get_acs(geography = “county”, variables = “B19013_001”)
  • Can request the variable(s) of interest
  • Can further subset the data to only include states or counties of interest
    • texas_income <- get_acs(geography = “county”, variables = c(hhincome = “B19013_001”), state = “TX”)
  • By default, census data are returned in tidy format, though wide (spread) data may be desirable at times
    • get_acs(geography = “county”, variables = c(hhincome = “B19013_001”, medage = “B01002_001”), state = “TX”, output = “wide”)

Searching for data with tidycensus:

  • There are 1000s of variables across ACS and the census, so it can be challenging to find variables of interest
  • Builit-in search capability in tidycensus
    • v16 <- load_variables(year = 2016, dataset = “acs5”, cache = TRUE) # year is the end year, so 2012-2016; cache=TRUE will store the dataset locally for future browsing
  • The ACS variable structure, as shown using B19001_002E (income < 10k dollars)
    • B means “base”, while C means “collapse”, P means “profile”, and S means “subject”
    • The 19001 is the table ID (household income)
    • The 002 is a specific variable
    • The E means that it is an estimate (tidycensus by default returns both error and margin of error)

Visualizing census data with ggplot2:

  • Can plot census data using ggplot2
    • ne_income <- get_acs(geography = “state”, variables = “B19013_001”, survey = “acs1”, state = c(“ME”, “NH”, “VT”, “MA”, “RI”, “CT”, “NY”))
    • ggplot(ne_income, aes(x = estimate, y = NAME)) + geom_point()
  • Can improve plots using sorting and formatting
    • ggplot(ne_income, aes(x = estimate, y = reorder(NAME, estimate))) + geom_point(color = “navy”, size = 4) + scale_x_continuous(labels = scales::dollar) + theme_minimal(base_size = 14) + labs(x = “2016 ACS estimate”, y = “”, title = “Median household income by state”)

Example code includes:

# Load the tidycensus package into your R session
library(tidycensus)

# Define your Census API key and set it with census_api_key()
api_key <- "INSERT HERE"
census_api_key(api_key)

# Check your API key
Sys.getenv("CENSUS_API_KEY")


# Obtain and view state populations from the 2010 US Census
state_pop <- get_decennial(geography = "state", variables = "P001001")

head(state_pop)

# Obtain and view state median household income from the 2012-2016 American Community Survey
state_income <- get_acs(geography = "state", variables = "B19013_001")

head(state_income)


# Get an ACS dataset for Census tracts in Texas by setting the state
tx_income <- get_acs(geography = "tract",
                     variables = "B19013_001",
                     state = "TX")

# Inspect the dataset
head(tx_income)

# Get an ACS dataset for Census tracts in Travis County, TX
travis_income <- get_acs(geography = "tract",
                         variables = "B19013_001", 
                         state = "TX",
                         county = "Travis")

# Inspect the dataset
head(travis_income)

# Supply custom variable names
travis_income2 <- get_acs(geography = "tract", 
                          variables = c(hhincome = "B19013_001"), 
                          state = "TX",
                          county = "Travis")

# Inspect the dataset
head(travis_income2)


# Return county data in wide format
or_wide <- get_acs(geography = "county", state = "OR",
                   variables = c(hhincome = "B19013_001", medage = "B01002_001"), 
                   output = "wide"
                   )

# Compare output to the tidy format from previous exercises
head(or_wide)

# Create a scatterplot
plot(or_wide$hhincomeE, or_wide$medageE)


# Load variables from the 2012-2016 ACS
v16 <- load_variables(year = 2016, dataset = "acs5", cache = TRUE)

# Get variables from the ACS Data Profile
v16p <- load_variables(year = 2016, dataset = "acs5/profile", cache = TRUE)

# Set year and dataset to get variables from the 2000 Census SF3
v00 <- load_variables(year = 2000, dataset = "sf3", cache = TRUE)


# Filter for table B19001
filter(v16, str_detect(name, "B19001"))

# Use public transportation to search for related variables
filter(v16p, str_detect(label, fixed("public transportation", ignore_case = TRUE)))


# Access the 1-year ACS  with the survey parameter
ne_income <- get_acs(geography = "state",
                     variables = "B19013_001", 
                     survey = "acs1", 
                     state = c("ME", "NH", "VT", "MA", 
                               "RI", "CT", "NY"))

# Create a dot plot
ggplot(ne_income, aes(x = estimate, y = NAME)) + 
  geom_point()

# Reorder the states in descending order of estimates
ggplot(ne_income, aes(x = estimate, y = reorder(NAME, estimate))) + 
  geom_point()


# Set dot color and size
g_color <- ggplot(ne_income, aes(x = estimate, y = reorder(NAME, estimate))) + 
  geom_point(color = "navy", size = 4)

# Format the x-axis labels
g_scale <- g_color + 
  scale_x_continuous(labels = scales::dollar) + 
  theme_minimal(base_size = 18) 

# Label your x-axis, y-axis, and title your chart
g_label <- g_scale + 
  labs(x = "2016 ACS estimate", y = "", title = "Median household income by state")
  
g_label

Chapter 2 - Wrangling US Census Data

Tables and summary variables in tidycensus:

  • Can grab all of the variables from a table at once using the table= option
    • wa_income <- get_acs(geography = “county”, state = “WA”, table = “B19001”)
  • Can normalize based on denominators
    • race_vars <- c(White = “B03002_003”, Black = “B03002_004”, Native = “B03002_005”, Asian = “B03002_006”, HIPI = “B03002_007”, Hispanic = “B03002_012”)
    • tx_race <- get_acs(geography = “county”, state = “TX”, variables = race_vars, summary_var = “B03002_001”)
    • tx_race_pct <- tx_race %>% mutate(pct = 100 * (estimate / summary_est)) %>% select(NAME, variable, pct)

Census data wrangling with tidy tools:

  • Can use the tidyverse to interact with data that has been retrieved from ACS and the census
  • The split-apply-combine approach to analysis is commonly used
    • tx_largest <- tx_race %>% group_by(GEOID) %>% filter(estimate == max(estimate)) %>% select(NAME, variable, estimate)
    • tx_largest %>% group_by(variable) %>% tally()
  • Can recode variables for group-wise analysis using case_when()
    • wa_grouped <- wa_income %>% filter(variable != “B19001_001”) %>% mutate(incgroup = case_when( variable < “B19001_008” ~ “below35k”, variable < “B19001_013” ~ “35kto75k”, TRUE ~ “above75k”)) %>% group_by(NAME, incgroup) %>% summarize(group_est = sum(estimate))
  • Can use purrr to gather data from multiple years and then combine
    • mi_cities <- map_df(2012:2016, function(x) { get_acs(geography = “place”, variables = c(totalpop = “B01003_001”), state = “MI”, survey = “acs1”, year = x) %>% mutate(year = x) })

Working with margins of error in tidycensus:

  • Margins of error exist in the ACS data since it is an annual survey of ~3 million; the MOE is the value needed to obtain the 95% CI
    • get_acs(geography = “county”, variables = c(median_age = “B01002_001”), state = “OR”)
  • With sparse data (small geographies or niche queries), moe can be a meaningful percentage of the estimate, even exceeding it at times
    • vt_eldpov <- get_acs(geography = “tract”, variables = c(eldpovm = “B17001_016”, eldpovf = “B17001_030”), state = “VT”)
  • There are multiple tidycensus functions for calculating margin of error
    • moe_sum(): MOE for a derived sum
    • moe_product(): MOE for a derived product
    • moe_ratio(): MOE for a derived ratio
    • moe_prop(): MOE for a derived proportion
    • vt_eldpov2 <- vt_eldpov %>% group_by(GEOID) %>% summarize( estmf = sum(estimate), moemf = moe_sum(moe = moe, estimate = estimate) )

Visualizing margins of error from ACS:

  • Can use an error-bar plot to show the range of uncertainty - geom_errorbar() and geom_errorbarh()
    • wyoming_age <- get_acs(geography = “county”, variables = c(medianage = “B01002_001”), state = “WY”)
    • ggplot(wyoming_age, aes(x = estimate, y = NAME)) + geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) + geom_point()
  • Can clean up visualizations - remove redundancies in names and order values
    • wyoming_age2 <- wyoming_age %>% mutate(NAME = str_replace(NAME, " County, Wyoming“,”“))
    • ggplot(wyoming_age2, aes(x = estimate, y = reorder(NAME, estimate))) + geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) + geom_point(size = 3, color = “darkgreen”) + theme_grey(base_size = 14) + labs(title = “Median age, counties in Wyoming”, subtitle = “2012-2016 American Community Survey”, x = “ACS estimate (bars represent margins of error)”, y = “”)

Example code includes:

library(tidycensus)

# Download table "B19001"
wa_income <- get_acs(geography = "county", 
                 state = "WA", 
                 table = "B19001")

# Check out the first few rows of wa_income
head(wa_income)


# Assign Census variables vector to race_vars
race_vars <- c(White = "B03002_003", Black = "B03002_004", Native = "B03002_005", 
               Asian = "B03002_006", HIPI = "B03002_007", Hispanic = "B03002_012"
               )

# Request a summary variable from the ACS
ca_race <- get_acs(geography = "county", 
                   state = "CA",
                   variables = race_vars, 
                   summary_var = "B03002_001")

# Calculate a new percentage column and check the result
ca_race_pct <- ca_race %>%
  mutate(pct = 100 * (estimate / summary_est))

head(ca_race_pct)


# Group the dataset and filter the estimate
ca_largest <- ca_race %>%
  group_by(GEOID) %>%
  filter(estimate == max(estimate)) 

head(ca_largest)

# Group the dataset and get a breakdown of the results
ca_largest %>% 
  group_by(variable) %>%
  tally()


# Use a tidy workflow to wrangle ACS data
wa_grouped <- wa_income %>%
  filter(variable != "B19001_001") %>%
  mutate(incgroup = case_when(
    variable < "B19001_008" ~ "below35k", 
    variable < "B19001_013" ~ "35kto75k", 
    TRUE ~ "above75k"
  )) %>%
  group_by(NAME, incgroup) %>%
  summarize(group_est = sum(estimate))

wa_grouped


# Map through ACS1 estimates to see how they change through the years
mi_cities <- map_df(2012:2016, function(x) {
  get_acs(geography = "place", 
          variables = c(totalpop = "B01003_001"), 
          state = "MI", 
          survey = "acs1", 
          year = x) %>%
    mutate(year = x)
})

mi_cities %>% arrange(NAME, year)


# Get data on elderly poverty by Census tract in Vermont
vt_eldpov <- get_acs(geography = "tract", 
                     variables = c(eldpovm = "B17001_016", 
                                   eldpovf = "B17001_030"), 
                     state = "VT")

vt_eldpov

# Identify rows with greater margins of error than their estimates
moe_check <- filter(vt_eldpov, moe > estimate)

# Check proportion of rows where the margin of error exceeds the estimate
nrow(moe_check) / nrow(vt_eldpov)


# Calculate a margin of error for a sum
moe_sum(moe = c(55, 33, 44, 12, 4))

# Calculate a margin of error for a product
moe_product(est1 = 55, est2 = 33, moe1 = 12, moe2 = 9)

# Calculate a margin of error for a ratio
moe_ratio(num = 1000, denom = 950, moe_num = 200, moe_denom = 177)

# Calculate a margin of error for a proportion
moe_prop(num = 374, denom = 1200, moe_num = 122, moe_denom = 333)


# Group the dataset and calculate a derived margin of error
vt_eldpov2 <- vt_eldpov %>%
  group_by(GEOID) %>%
  summarize(
    estmf = sum(estimate), 
    moemf = moe_sum(moe = moe, estimate = estimate)
  )

# Filter rows where newly-derived margin of error exceeds newly-derived estimate
moe_check2 <- filter(vt_eldpov2, moemf > estmf)

# Check proportion of rows where margin of error exceeds estimate
nrow(moe_check2) / nrow(vt_eldpov2)


# Request median household income data
maine_inc <- get_acs(geography = "county", 
                     variables = c(hhincome = "B19013_001"), 
                     state = "ME") 

# Generate horizontal error bars with dots
ggplot(maine_inc, aes(x = estimate, y = NAME)) + 
  geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) + 
  geom_point()


# Remove unnecessary content from the county's name
maine_inc2 <- maine_inc %>%
  mutate(NAME = str_replace(NAME, " County, Maine", ""))

# Build a margin of error plot incorporating your modifications
ggplot(maine_inc2, aes(x = estimate, y = reorder(NAME, estimate))) + 
  geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) + 
  geom_point(size = 3, color = "darkgreen") + 
  theme_grey(base_size = 14) + 
  labs(title = "Median household income", 
       subtitle = "Counties in Maine", 
       x = "ACS estimate (bars represent margins of error)", 
       y = "") + 
  scale_x_continuous(labels = scales::dollar)

Chapter 3 - US Census Geographic Data in R

Understanding census geography and tigris basics:

  • The TIGER line shape files are made publicly available by the US Census Bureau
  • The tigris package simplifies the process of downloading and mapping with TIGER shapes
    • library(tigris)
    • az_counties <- counties(state = “AZ”)
    • plot(az_counties)
    • nh_roads <- primary_secondary_roads(state = “NH”)
    • plot(nh_roads)
  • By default, tigris returns objects in Spatial DF format - slots encode characteristics of the data

Customizing tigris options:

  • The tigris package allows for customization of plotting options
    • ri_tiger <- counties(“RI”)
    • ri_cb <- counties(“RI”, cb = TRUE)
    • par(mfrow = c(1, 2))
    • plot(ri_tiger, main = “TIGER/Line”)
    • plot(ri_cb, main = “Cartographic boundary”)
  • Can use the sf package for further plotting of tigris data
    • options(tigris_class = “sf”)
    • az_sf <- counties(“AZ”, cb = TRUE)
    • class(az_sf)
  • Can cache tigris shape files on a local computer
    • options(tigris_use_cache = TRUE)
  • The defaults in tigris are to the most recent year of data, though this can be overridden by argument
    • williamson90 <- tracts(state = “TX”, county = “Williamson”, cb = TRUE, year = 1990)
    • williamson16 <- tracts(state = “TX”, county = “Williamson”, cb = TRUE, year = 2016)

Combining and joining census geographic districts:

  • Can combine data obtained from multiple tigris data pulls; for example, studying Kansas City MO/KS
    • missouri <- tracts(“MO”, cb = TRUE)
    • kansas <- tracts(“KS”, cb = TRUE)
    • kansas_missouri <- rbind_tigris(kansas, missouri)
    • plot(kansas_missouri$geometry)
  • Can also create a geography such as New England using iteration
    • new_england <- c(“ME”, “NH”, “VT”, “MA”)
    • ne_tracts <- map(new_england, function(x) { tracts(state = x, cb = TRUE) }) %>% rbind_tigris()
  • Can use the standard sf joining tool provided that the data are available in sf format
    • tx_house <- state_legislative_districts(state = “TX”, house = “lower”, cb = TRUE)
    • tx_joined <- left_join(tx_house, tx_members, by = c(“NAME” = “District”))

Plotting data with tigris and ggplot2:

  • Can create maps of legislative districts by political party currently holding the office
    • ggplot(tx_joined) + geom_sf()
    • ggplot(tx_joined, aes(fill = Party)) + geom_sf()
    • ggplot(tx_joined, aes(fill = Party)) + geom_sf() + scale_fill_manual(values = c(“R” = “red”, “D” = “blue”))
    • ggplot(tx_joined, aes(fill = Party)) + geom_sf() + coord_sf(crs = 3083, datum = NA) + scale_fill_manual(values = c(“R” = “red”, “D” = “blue”)) + theme_minimal() + labs(title = “State House Districts in Texas”)

Example code includes:

library(tigris)

# Get a counties dataset for Colorado and plot it
co_counties <- counties(state = "CO")
plot(co_counties)


# Get a Census tracts dataset for Denver County, Colorado and plot it
denver_tracts <- tracts(state = "CO", county = "Denver")
plot(denver_tracts)


# Plot area water features for Lane County, Oregon
lane_water <- area_water(state = "OR", county = "Lane")
plot(lane_water)

# Plot primary & secondary roads for the state of New Hampshire
nh_roads <- primary_secondary_roads(state = "NH")
plot(nh_roads)


# Check the class of the data
class(co_counties)

# Take a look at the information in the data slot
head(co_counties@data)

# Check the coordinate system of the data
co_counties@proj4string


# Get a counties dataset for Michigan
mi_tiger <- counties("MI")

# Get the equivalent cartographic boundary shapefile
mi_cb <- counties("MI", cb = TRUE)

# Overlay the two on a plot to make a comparison
plot(mi_tiger)
plot(mi_cb, add = TRUE, border = "red")


# Get data from tigris as simple features
options(tigris_class = "sf")

# Get countries from Colorado and view the first few rows
colorado_sf <- counties("CO")
head(colorado_sf)

# Plot its geometry column
plot(colorado_sf$geometry)


# DO NOT ADD CACHE FOR NOW
# Set the cache directory
# tigris_cache_dir("Your preferred cache directory path would go here")

# Set the tigris_use_cache option
# options(tigris_use_cache = TRUE)

# Check to see that you've modified the option correctly
# getOption("tigris_use_cache")


# Get a historic Census tract shapefile from 1990 for Williamson County, Texas
williamson90 <- tracts(state = "TX", county = "Williamson", 
                       cb = TRUE, year = 1990)

# Compare with a current dataset for 2016
williamson16 <- tracts(state = "TX", county = "Williamson", 
                       cb = TRUE, year = 2016)

# Plot the geometry to compare the results                       
par(mfrow = c(1, 2))
plot(williamson90$geometry)
plot(williamson16$geometry)


# Get Census tract boundaries for Oregon and Washington
or_tracts <- tracts("OR", cb = TRUE)
wa_tracts <- tracts("WA", cb = TRUE)

# Check the tigris attributes of each object
attr(or_tracts, "tigris")
attr(wa_tracts, "tigris")

# Combine the datasets then plot the result
or_wa_tracts <- rbind_tigris(or_tracts, wa_tracts)
plot(or_wa_tracts$geometry)


# Generate a vector of state codes and assign to new_england
new_england <- c("ME", "NH", "VT", "MA")

# Iterate through the states and request tract data for state
ne_tracts <- map(new_england, function(x) {
  tracts(state = x, cb = TRUE)
}) %>%
  rbind_tigris()

plot(ne_tracts$geometry)


# Get boundaries for Texas and set the house parameter
tx_house <- state_legislative_districts(state = "TX", house = "lower", cb = TRUE)

# Merge data on legislators to their corresponding boundaries
tx_joined <- left_join(tx_house, tx_members, by = c("NAME" = "District"))

head(tx_joined)


# Plot the legislative district boundaries
ggplot(tx_joined) + 
  geom_sf()

# Set fill aesthetic to map areas represented by Republicans and Democrats
ggplot(tx_joined, aes(fill = Party)) + 
  geom_sf()

# Set values so that Republican areas are red and Democratic areas are blue
ggplot(tx_joined, aes(fill = Party)) + 
  geom_sf() + 
  scale_fill_manual(values = c("R" = "red", "D" = "blue"))


# Draw a ggplot without gridlines and with an informative title
ggplot(tx_joined, aes(fill = Party)) + 
  geom_sf() + 
  coord_sf(crs = 3083, datum = NA) + 
  scale_fill_manual(values = c("R" = "red", "D" = "blue")) + 
  theme_minimal(base_size = 16) + 
  labs(title = "State House Districts in Texas")

Chapter 4 - Mapping US Census Data

Simple feature geometry and tidycensus:

  • The tidycensus package can wrap the tigris package for some of its less complex data requests
    • geometry = TRUE is available for the following geographies: state, county, tract, ‘block group’, block, zcta
    • cook_value <- get_acs(geography = “tract”, state = “IL”, county = “Cook”, variables = “B25077_001”, geometry = TRUE)
  • For geographies that are not available for grabbing geography by default, can use join instead
    • idaho_income <- get_acs(geography = “school district (unified)”, variables = “B19013_001”, state = “ID”)
    • idaho_school <- school_districts(state = “ID”, type = “unified”, class = “sf”)
    • id_school_joined <- left_join(idaho_school, idaho_income, by = “GEOID”)
  • Maps may move and rescale AK and HI - set shift_geo=TRUE to implement this
    • state_value <- get_acs(geography = “state”, variables = “B25077_001”, survey = “acs1”, geometry = TRUE, shift_geo = TRUE)

Mapping demographic data with ggplot2:

  • Can use fills to create instructive maps, and can edit for improved clarity
    • ggplot(cook_value, aes(fill = estimate)) + geom_sf()
    • ggplot(cook_value, aes(fill = estimate, color = estimate)) + geom_sf() + scale_fill_viridis_c() + scale_color_viridis_c()
    • ggplot(cook_value, aes(fill = estimate, color = estimate)) + geom_sf() + scale_fill_viridis_c(labels = scales::dollar) + scale_color_viridis_c(guide = FALSE) + theme_minimal() + coord_sf(crs = 26916, datum = NA) + labs(title = “Median home value by Census tract”, subtitle = “Cook County, Illinois”, caption = “Data source: 2012-2016 ACS.acquired with the R tidycensus package.”, fill = “ACS estimate”)

Advance demographic mapping:

  • There are many visual variables in cartography - position, size, shape, value, hue, orientation, texture
  • The graduated symbol map is a popular plotting method - shapes are differentially sized based on underlying variable values
    • centers <- st_centroid(state_value)
    • ggplot() + geom_sf(data = state_value, fill = “white”) + geom_sf(data = centers, aes(size = estimate), shape = 21, fill = “lightblue”, alpha = 0.7, show.legend = “point”) + scale_size_continuous(range = c(1, 20))
  • The “small multiples” plot is another common plot type - achieved with faceting
    • ggplot(dc_race, aes(fill = percent, color = percent)) + geom_sf() + coord_sf(datum = NA) + facet_wrap(~variable)
  • Interactive visualizations are continually more prevalent, and many of the libraries are wrapped for use in R - leaflet, plotly, htmlwidgets
    • library(mapview)
    • mapview(cook_value, zcol = “estimate”, legend = TRUE)

Cartographic workflows with tigris and tidycensus:

  • Can create maps using a dot-density distribution; dots are scattered by a sub-unit (such as census tract), proportional to the size (such as population), colored optionally by a third attriubute (such as race)
    • dc_dots <- map(c(“White”, “Black”, “Hispanic”, “Asian”), function(group) { dc_race %>% filter(variable == group) %>% st_sample(., size=.$value / 100) %>% st_sf() %>% mutate(group = group) }) %>% reduce(rbind)
    • dc_dots <- dc_dots %>% group_by(group) %>% summarize() # for faster plotting
    • dc_dots_shuffle <- sample_frac(dc_dots, size = 1) # ensures random dots rather than clumped dots by group
    • plot(dc_dots_shuffle, key.pos = 1)
  • Can be valuable to supplement maps with roadways and bodies of water
    • options(tigris_class = “sf”)
    • dc_roads <- roads(“DC”, “District of Columbia”) %>% filter(RTTYP %in% c(“I”, “S”, “U”))
    • dc_water <- area_water(“DC”, “District of Columbia”)
    • dc_boundary <- counties(“DC”, cb = TRUE)
    • plot(dc_water$geometry, col = “lightblue”)
    • ggplot() + geom_sf(data = dc_boundary, color = NA, fill = “white”) + geom_sf(data = dc_dots, aes(color = group, fill = group), size = 0.1) + geom_sf(data = dc_water, color = “lightblue”, fill = “lightblue”) + geom_sf(data = dc_roads, color = “grey”) + coord_sf(crs = 26918, datum = NA) + scale_color_brewer(palette = “Set1”, guide = FALSE) + scale_fill_brewer(palette = “Set1”) + labs(title = “The racial geography of Washington, DC”, subtitle = “2010 decennial U.S. Census”, fill = “”, caption = “1 dot = approximately 100 people.acquired with the R tidycensus and tigris packages.”)
  • Note that ggplot2 plots the layers in order, meaning that they tend to over-write lower layers

Next steps for working with demographic data in R:

  • Additional R packages for deomgraphic data include censusapi, ipumsr (MN data), cancensus (demographic and census data from Canadian census)

Example code includes:

library(sf)

# Get dataset with geometry set to TRUE
orange_value <- get_acs(geography = "tract", state = "CA", county = "Orange", 
                        variables = "B25077_001", geometry = TRUE
                        )

# Plot the estimate to view a map of the data
plot(orange_value["estimate"])


# Get an income dataset for Idaho by school district
idaho_income <- get_acs(geography = "school district (unified)", 
                        variables = "B19013_001", 
                        state = "ID")

# Get a school district dataset for Idaho
idaho_school <- school_districts(state = "ID", type = "unified", class = "sf")

# Join the income dataset to the boundaries dataset
id_school_joined <- left_join(idaho_school, idaho_income, by = "GEOID")

plot(id_school_joined["estimate"])


# Get a dataset of median home values from the 1-year ACS
state_value <- get_acs(geography = "state", 
                       variables = "B25077_001", 
                       survey = "acs1", 
                       geometry = TRUE, 
                       shift_geo = TRUE)

# Plot the dataset to view the shifted geometry
plot(state_value["estimate"])


# Create a choropleth map with ggplot
ggplot(marin_value, aes(fill = estimate)) + 
  geom_sf()


# Set continuous viridis palettes for your map
ggplot(marin_value, aes(fill = estimate, color = estimate)) + 
  geom_sf() + 
  scale_fill_viridis_c() +  
  scale_color_viridis_c()


# Set the color guide to FALSE and add a subtitle and caption to your map
ggplot(marin_value, aes(fill = estimate, color = estimate)) + 
  geom_sf() + 
  scale_fill_viridis_c(labels = scales::dollar) +  
  scale_color_viridis_c(guide = FALSE) + 
  theme_minimal() + 
  coord_sf(crs = 26911, datum = NA) + 
  labs(title = "Median owner-occupied housing value by Census tract", 
       subtitle = "Marin County, California", 
       caption = "Data source: 2012-2016 ACS.\nData acquired with the R tidycensus package.", 
       fill = "ACS estimate")


# Generate point centers
centers <- st_centroid(state_value)

# Set size parameter and the size range
ggplot() + 
  geom_sf(data = state_value, fill = "white") + 
  geom_sf(data = centers, aes(size = estimate), shape = 21, 
          fill = "lightblue", alpha = 0.7, show.legend = "point") + 
  scale_size_continuous(range = c(1, 20))


# Check the first few rows of the loaded dataset dc_race
head(dc_race)

# Remove the gridlines and generate faceted maps
ggplot(dc_race, aes(fill = percent, color = percent)) + 
  geom_sf() + 
  coord_sf(datum = NA) + 
  facet_wrap(~variable)


# Map the orange_value dataset interactively
m <- mapview(orange_value)
m@map

# Map your data by the estimate column
m <- mapview(orange_value, zcol = "estimate")
m@map

# Add a legend to your map
m <- mapview(orange_value, zcol = "estimate", legend=TRUE)

m@map


# Generate dots, create a group column, and group by group column
dc_dots <- map(c("White", "Black", "Hispanic", "Asian"), function(group) {
  dc_race %>%
    filter(variable == group) %>%
    st_sample(., size = .$value / 100) %>%
    st_sf() %>%
    mutate(group = group) 
}) %>%
  reduce(rbind) %>%
  group_by(group) %>%
  summarize()


# Filter the DC roads object for major roads only
dc_roads <- roads("DC", "District of Columbia") %>%
  filter(RTTYP %in% c("I", "S", "U"))

# Get an area water dataset for DC
dc_water <- area_water("DC", "District of Columbia")

# Get the boundary of DC
dc_boundary <- counties("DC", cb = TRUE)


# Plot your datasets and give your map an informative caption
ggplot() + 
  geom_sf(data = dc_boundary, color = NA, fill = "white") + 
  geom_sf(data = dc_dots, aes(color = group, fill = group), size = 0.1) + 
  geom_sf(data = dc_water, color = "lightblue", fill = "lightblue") + 
  geom_sf(data = dc_roads, color = "grey") + 
  coord_sf(crs = 26918, datum = NA) + 
  scale_color_brewer(palette = "Set1", guide = FALSE) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "The racial geography of Washington, DC", 
       subtitle = "2010 decennial U.S. Census", 
       fill = "", 
       caption = "1 dot = approximately 100 people.\nData acquired with the R tidycensus and tigris packages.")

Multivariate Probability Distributions in R

Chapter 1 - Reading and Plotting Mutivariate Data

Reading multivariate data:

  • Multivariate distributions describe 2+ variables (particularly when correlated) at the same time
  • Generally, multivariate data are in the tidy format - rows are observations, columns are variables/attributes
    • iris_url <- “http://mlg.eng.cam.ac.uk/teaching/3f3/1011/iris.data
    • iris_raw <- read.table(iris_url, sep =“”, header = FALSE)
    • head(iris_raw, n = 4)
    • colnames(iris_raw) <- c(“Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width”, “Species” )
    • bwt <- read.csv(“birthweight.csv”, row.names = 1)
  • Can access specific portions of the data and change data types as needed
    • iris_raw\(species <- as.factor(iris_raw\)species)
    • iris_raw\(Species <- recode(iris_raw\)Species, " 1 =‘setosa’; 2 = ‘versicolor’; 3 = ‘virginica’“)

Mean vector and variance-covariance matrix:

  • Can use summary statistics to explore the dataset
  • The mean vector is the vector containing the means of each of the dimensions, while the variance-covariance matrix shows the variance and angles of the data
    • colMeans(iris_raw[, 1:4])
    • by(data = iris[,1:4], INDICES = iris$Species, FUN = colMeans)
    • aggregate(. ~ Species, iris_raw, mean)
  • The variance-covariance matrix is produced using var() and cor()
    • var(iris_raw[, 1:4])
    • cor(iris_raw[, 1:4])
    • corrplot(cor(iris_raw[, 1:4]), method = “ellipse”)

Plotting mutivariate data:

  • Can look at multiple bivariate plots or mutivariate data in many manners
    • pairs(iris_raw[, 1:4]) # pair plot
    • pairs(iris_raw[, 1:4], col = iris_raw$Species)
    • lattice::splom(~iris_raw[, 1:4], col = iris_raw$Species, pch = 16)
    • Ggally::ggpairs(data = iris_raw, columns = 1:4, mapping = aes(color = Species))
    • scatterplot3d::scatterplot3d(iris_raw[, c(1, 3, 4)], color = as.numeric(iris_raw$Species))

Example code includes:

# Read in the wine dataset
wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep = ",")

# Print the first four entries
head(wine, n=4)
##   V1    V2   V3   V4   V5  V6   V7   V8   V9  V10  V11  V12  V13  V14
## 1  1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
## 2  1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
## 3  1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
## 4  1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
# Find the dimensions of the data
dim(wine)
## [1] 178  14
# Check the names of the wine dataset 
names(wine)
##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11"
## [12] "V12" "V13" "V14"
# Assign new names
names(wine) <- c('Type', 'Alcohol', 'Malic', 'Ash', 'Alcalinity', 'Magnesium', 'Phenols', 'Flavanoids', 'Nonflavanoids','Proanthocyanins', 'Color', 'Hue', 'Dilution', 'Proline')
                      
# Check the new column names
names(wine)
##  [1] "Type"            "Alcohol"         "Malic"          
##  [4] "Ash"             "Alcalinity"      "Magnesium"      
##  [7] "Phenols"         "Flavanoids"      "Nonflavanoids"  
## [10] "Proanthocyanins" "Color"           "Hue"            
## [13] "Dilution"        "Proline"
# Check data type of each variable
str(wine)
## 'data.frame':    178 obs. of  14 variables:
##  $ Type           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol        : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash            : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity     : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium      : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Phenols        : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids     : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoids  : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins: num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color          : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue            : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ Dilution       : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline        : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
# Change the Type variable data type
wine$Type <- as.factor(wine$Type)

# Check data type again 
str(wine)
## 'data.frame':    178 obs. of  14 variables:
##  $ Type           : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol        : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash            : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity     : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium      : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Phenols        : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids     : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoids  : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins: num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color          : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue            : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ Dilution       : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline        : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
# Calculate the mean of the Alcohol, Malic, Ash, and Alcalinity variables 
colMeans(wine[, 2:5])
##    Alcohol      Malic        Ash Alcalinity 
##  13.000618   2.336348   2.366517  19.494944
# Calculate the mean of the variables by wine type
by(wine[, 2:5], wine$Type, FUN=colMeans)
## wine$Type: 1
##    Alcohol      Malic        Ash Alcalinity 
##  13.744746   2.010678   2.455593  17.037288 
## -------------------------------------------------------- 
## wine$Type: 2
##    Alcohol      Malic        Ash Alcalinity 
##  12.278732   1.932676   2.244789  20.238028 
## -------------------------------------------------------- 
## wine$Type: 3
##    Alcohol      Malic        Ash Alcalinity 
##  13.153750   3.333750   2.437083  21.416667
# Calculate the variance-covariance matrix of the variables Alcohol, Malic, Ash, Alcalinity
var.wine <- var(wine[, 2:5])

# Round the matrix values to two decimal places 
round(var.wine, 2)
##            Alcohol Malic  Ash Alcalinity
## Alcohol       0.66  0.09 0.05      -0.84
## Malic         0.09  1.25 0.05       1.08
## Ash           0.05  0.05 0.08       0.41
## Alcalinity   -0.84  1.08 0.41      11.15
# Calculate the covariance matrix 
cor.wine <- cor(wine[, 2:5])

# Round the matrix to two decimal places 
round(cor.wine, 2)
##            Alcohol Malic  Ash Alcalinity
## Alcohol       1.00  0.09 0.21      -0.31
## Malic         0.09  1.00 0.16       0.29
## Ash           0.21  0.16 1.00       0.44
## Alcalinity   -0.31  0.29 0.44       1.00
# Plot the correlations 
corrplot::corrplot(cor.wine, method = "ellipse")

# Scatter plot matrix with base R 
pairs(wine[, 2:5])

# Scatter plot matrix with lattice  
lattice::splom(~wine[, 2:5])

# Scatter plot matrix colored by groups
lattice::splom( ~ wine[2:5], pch = 16, col=wine$Type)

# Produce a matrix of plots for the first four variables 
wine.gg <- GGally::ggpairs(wine[, 2:5])
wine.gg

# Produce a matrix of plots for the first four variables 
wine.gg <- GGally::ggpairs(wine, columns=2:5)
wine.gg

# Color the points by wine type 
wine.gg <- GGally::ggpairs(data = wine, columns = 2:5, aes(color=Type))
wine.gg

# Plot the three variables 
scatterplot3d::scatterplot3d(wine[, c("Alcohol", "Malic", "Alcalinity")], color=wine$Type)


Chapter 2 - Multivariate Normal Distribution

Multivariate Normal Distribution:

  • Important probability distribution - generalization of the univariate normal, but with a variance-covariance matrix
    • Elliptical, with center at the joint mean
    • Shape of the ellipse depends on the variance-covariance matrix
    • If the covariance is 0 and the variances are equal, then this is a circle
    • If the covariance is high, then this will become like a line
  • Can use mvtnorm::rmvnorm() and the related functions

Density of a multivariate normal distribution:

  • Can use the dnorm() function to get the height of the density curve at any given location
  • Can use densities to estimate the likelihood that a data point is generated from a given distribution
    • dmvnorm(x, mean, sigma)
    • mu1 <- c(1, 2)
    • sigma1 <- matrix(c(1, .5, .5, 2), 2)
    • dmvnorm(x = c(0, 0), mean = mu1, sigma = sigma1)
    • x <- rbind(c(0, 0), c(1, 1), c(0, 1)); x
    • dmvnorm(x = x, mean = mu, sigma = sigma)
  • Can create a perspective plot using the persp() function
    • d <- expand.grid(seq(-3, 6, length.out = 50 ), seq(-3, 6, length.out = 50))
    • dens1 <- dmvnorm(as.matrix(d), mean=c(1,2), sigma=matrix(c(1, .5, .5, 2), 2))
    • dens1 <- matrix(dens1, nrow = 50 )
    • persp(dens1, theta = 80, phi = 30, expand = 0.6, shade = 0.2, col = “lightblue”, xlab = “x”, ylab = “y”, zlab = “dens”)

Cumulative distribution and inverse CDF:

  • Can be valuable to calculate the CDF and the inverse CDF for statistical purposes
    • mu1 <- c(1, 2)
    • sigma1 <- matrix(c(1, 0.5, 0.5, 2), 2)
    • pmvnorm(upper = c(2, 4), mean = mu1, sigma = sigma1)
    • pmvnorm(lower = c(1, 2), upper = c(2, 4), mean = mu1, sigma = sigma1)
  • Suppose that you want to find the smallest ellipse that contains 95% of the density
    • sigma1 <- diag(2) # force the contours to be circular
    • qmvnorm(p = 0.95, sigma = sigma1, tail = “both”)

Checking normality of multivariate data:

  • Normality is a convenient assumption for simplifying many common statistical tests
    • qqnorm(iris_raw[, 1])
    • qqline(iris_raw[, 1])
    • uniPlot(iris_raw[, 1:4])
  • Can use several tests for normality of the multivariate distributions
    • mardiaTest(iris_raw[, 1:4])
    • mardiaTest(iris_raw[, 1:4], qqplot = TRUE)
    • hzTest(iris_raw[,1:4])
    • mardiaTest(iris[iris_raw$Species == “setosa”, 1:4])

Example code includes:

mu.sim <- c(2, -2)
sigma.sim <- matrix(data=c(9, 5, 5, 4), nrow=2, byrow=FALSE)

mu.sim
## [1]  2 -2
sigma.sim
##      [,1] [,2]
## [1,]    9    5
## [2,]    5    4
# Generate 100 bivariate normal samples
multnorm.sample <- mvtnorm::rmvnorm(100, mean=mu.sim, sigma=sigma.sim)

# View the first 6 samples
head(multnorm.sample)
##            [,1]      [,2]
## [1,] -1.5479933 -3.650487
## [2,]  0.2534189 -2.858655
## [3,]  1.8644954 -2.708265
## [4,]  0.2055723 -2.143091
## [5,]  0.3575927 -4.418678
## [6,]  1.0176200 -3.387277
# Scatterplot of the bivariate samples 
plot(multnorm.sample)

# Calculate density
multnorm.dens <- mvtnorm::dmvnorm(multnorm.sample, mean = mu.sim, sigma = sigma.sim)

# Create scatter plot of density heights 
scatterplot3d::scatterplot3d(cbind(multnorm.sample, multnorm.dens), color="blue", pch="", 
                             type = "h", xlab = "x", ylab = "y", zlab = "density"
                             )

mvals <- expand.grid(seq(-5, 10, length.out = 40), seq(-8, 4, length.out = 40))
str(mvals)
## 'data.frame':    1600 obs. of  2 variables:
##  $ Var1: num  -5 -4.62 -4.23 -3.85 -3.46 ...
##  $ Var2: num  -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 ...
##  - attr(*, "out.attrs")=List of 2
##   ..$ dim     : int  40 40
##   ..$ dimnames:List of 2
##   .. ..$ Var1: chr  "Var1=-5.0000000" "Var1=-4.6153846" "Var1=-4.2307692" "Var1=-3.8461538" ...
##   .. ..$ Var2: chr  "Var2=-8.0000000" "Var2=-7.6923077" "Var2=-7.3846154" "Var2=-7.0769231" ...
# Calculate density over the specified grid
mvds <- mvtnorm::dmvnorm(mvals, mean=mu.sim, sigma=sigma.sim)
matrix_mvds <-  matrix(mvds, nrow = 40)

# Create a perspective plot
persp(matrix_mvds, theta = 80, phi = 30, expand = 0.6, shade = 0.2, 
      col = "lightblue", xlab = "x", ylab = "y", zlab = "dens"
      )

# Volume under a bivariate standard normal
mvtnorm::pmvnorm(lower = c(-1, -1), upper = c(1, 1))
## [1] 0.4660649
## attr(,"error")
## [1] 1e-15
## attr(,"msg")
## [1] "Normal Completion"
# Volume under specified mean and variance-covariance matrix
mvtnorm::pmvnorm(lower = c(-5, -5), upper = c(5, 5), mean = mu.sim, sigma = sigma.sim)
## [1] 0.7734162
## attr(,"error")
## [1] 1e-15
## attr(,"msg")
## [1] "Normal Completion"
# Probability contours for a standard bivariate normal
mvtnorm::qmvnorm(p=0.9, tail = "both", sigma = diag(2))
## $quantile
## [1] 1.948779
## 
## $f.quantile
## [1] -1.537507e-06
## 
## attr(,"message")
## [1] "Normal Completion"
# Probability contours for a bivariate normal 
mvtnorm::qmvnorm(p=0.95, tail = "both", mean=mu.sim, sigma=sigma.sim)
## $quantile
## [1] 7.110635
## 
## $f.quantile
## [1] 5.712626e-06
## 
## attr(,"message")
## [1] "Normal Completion"
# Test sample normality 
qqnorm(multnorm.sample[, 1])
qqline(multnorm.sample[, 1])

# requires RJAGS 4+
# Create qqnorm plot (no longer exported from MVN)
# MVN::uniPlot(wine[, c("Alcohol", "Malic", "Ash", "Alcalinity")], type = "qqplot")
# MVN::mvn(wine[, c("Alcohol", "Malic", "Ash", "Alcalinity")], univariatePlot = "qq")

# requires RJAGS 4+
# mardiaTest qqplot 
# wine.mvntest <- MVN::mardiaTest(wine[, 2:5])  # 'MVN::mardiaTest' is deprecated.\nUse 'mvn' instead.\nSee help(\"Deprecated\")
# wine.mvntest <- MVN::mvn(wine[, 2:5])
# wine.mvntest


# requires RJAGS 4+
# Use mardiaTest
# MVN::mvn(multnorm.sample)

# requires RJAGS 4+
# Use hzTest
# MVN::hzTest(wine[, 2:5])  # 'MVN::hzTest' is deprecated.  Use 'mvn' instead.
# MVN::mvn(wine[, 2:5], mvnTest="hz")

Chapter 3 - Other Multivariate Distributions

Other common multivariate distributions:

  • Not all multivariate distributions are normal - may be skewed or follow a different distribution
    • Normal has mean, sigma
    • t has delta, sigma with DF
    • Skew-normal has xi, Omega
    • Skew-t has xi, Omega with DF
  • For the t-distribution, tails are fatter than they would be for the normal (which behaves like a t-distribution with oo degrees of freedom)
    • rmvt(n, delta, sigma, df)
    • dmvt(x, delta, sigma, df)
    • qmvt(p, delta, sigma, df)
    • pmvt(upper, lower, delta, sigma, df)

Density and cumulative density for mutlivariate-T:

  • Individual stocks are often modeled by a univariate t-distribution
  • Portfolios are often valued using a multivariate t-distribution
    • dmvt(x, delta = rep(0, p), sigma = diag(p), log = TRUE)
    • x <- seq(-3, 6, by = 1); y <- seq(-3, 6, by = 1)
    • d <- expand.grid(x = x, y = x)
    • del1 <- c(1, 2); sig1 <- matrix(c(1, .5, .5, 2), 2)
    • dens <- dmvt(as.matrix(d), delta = del1, sigma = sig1, df = 10, log = FALSE)
    • scatterplot3d(cbind(d, dens), type = “h”, zlab = “density”)
  • Can claculate the pmvt()
    • pmvt(lower = -Inf, upper = Inf, delta, sigma, df, …)
    • pmvt(lower = c(-1, -2), upper = c(2, 2), delta = c(1, 2), sigma = diag(2), df = 6)
    • qmvt(p, interval, tail, delta, sigma, df)

Multivariate skewed distributions:

  • Normal and t distributions both model symmetric (non-skewed) data, though real-world data are frequently skewed
  • The univariate skew-normal includes psi (location) and omega (variance-covariance) and alpha (skew)
  • The sn library offers skew-normal as a distribution
    • dmsn(x, xi, Omega, alpha)
    • pmsn(x, xi, Omega, alpha)
    • rmsn(n, xi, Omega, alpha)
    • dmst(x, xi, Omega, alpha, nu) # nu is df
    • pmst(x, xi, Omega, alpha, nu) # nu is df
    • rmst(n, xi, Omega, alpha, nu ) # nu is df
  • Example of creating random data
    • xi1 <- c(1, 2, -5)
    • Omega1 <- matrix(c(1, 1, 0, 1, 2, 0, 0, 0, 5), 3, 3)
    • alpha1 <- c(4, 30, -5)
    • skew.sample <- rmsn(n = 2000, xi = xi1, Omega = Omega1, alpha = alpha1)
    • skewt.sample <- rmst(n = 2000, xi = xi1, Omega = Omega1, alpha = alpha1, nu = 4)
  • Can estimate the parameters from the data
    • msn.mle(y = skew.sample, opt.method = “BFGS”)

Example code includes:

# Generate the t-samples 
multt.sample <- mvtnorm::rmvt(200, delta=mu.sim, sigma=sigma.sim, df=5)

# Print the first 6 samples
head(multt.sample)
##               [,1]       [,2]
## [1,]  1.5698586720 -2.3958159
## [2,] -7.8935083134 -8.5106908
## [3,] -0.0009421534 -4.4741769
## [4,]  6.8646754087 -0.4493624
## [5,]  3.1696218853 -0.9010101
## [6,]  1.5311387470 -2.6552099
# Requires RJAGS 4+
# Check multivariate normality
# MVN::mvn(multt.sample, univariatePlot="qq", mvnTest="mardia")


# Calculate densities
multt.dens <- mvtnorm::dmvt(multt.sample, delta=mu.sim, sigma=sigma.sim, df=5) 

# Plot 3D heights of densities
scatterplot3d::scatterplot3d(cbind(multt.sample, multt.dens), color = "blue", pch = "", 
                             type = "h", xlab = "x", ylab = "y", zlab = "density"
                             )

# Calculate the volume under the specified t-distribution
mvtnorm::pmvt(lower = c(-5, -5), upper = c(5, 5), delta=mu.sim, sigma=sigma.sim, df=5)
## [1] 0.6627531
## attr(,"error")
## [1] 0.0006930966
## attr(,"msg")
## [1] "Normal Completion"
# Calculate the equal probability contour
mvtnorm::qmvt(p=0.9, tail="both", delta=0, sigma=diag(2), df=5)
## $quantile
## [1] 2.490408
## 
## $f.quantile
## [1] -2.150358e-07
## 
## attr(,"message")
## [1] "Normal Completion"
# Generate the skew-normal samples 
skewnorm.sample <- sn::rmsn(n=100, xi=mu.sim, Omega=sigma.sim, alpha=c(4, -4)) 

# Print first six samples
head(skewnorm.sample)
##          [,1]       [,2]
## [1,] 5.378809 -0.1682481
## [2,] 2.294349 -3.7606670
## [3,] 1.898728 -3.7117472
## [4,] 3.753504 -2.2154888
## [5,] 8.020180  0.6697977
## [6,] 2.902386 -1.8605283
# Generate the skew-t samples 
skewt.sample <- sn::rmst(n = 100, xi = mu.sim, Omega = sigma.sim, alpha = c(4, -4), nu=5)

# Print first six samples
head(skewt.sample)
##            [,1]       [,2]
## [1,]  2.7731981 -1.7126980
## [2,]  2.8562516 -2.2158996
## [3,]  6.8955406  0.8405366
## [4,] -4.0210404 -6.0080851
## [5,] -1.9499787 -4.7405617
## [6,] -0.1967472 -3.3460686
skewnorm.sampleDF <- data.frame(x=skewnorm.sample[, 1], y=skewnorm.sample[, 2])
str(skewnorm.sampleDF)
## 'data.frame':    100 obs. of  2 variables:
##  $ x: num  5.38 2.29 1.9 3.75 8.02 ...
##  $ y: num  -0.168 -3.761 -3.712 -2.215 0.67 ...
# Contour plot for skew-normal sample
ggplot(skewnorm.sampleDF, aes(x=x, y=y)) + 
    geom_point() + 
    geom_density_2d()

# Requires RJAGS 4+
# Normality test for skew-normal sample
# skewnorm.Test <- MVN::mvn(skewnorm.sample, mvnTest="mardia", univariatePlot="qq")

# Requires RJAGS 4+
# Normality test for skew-t sample
# skewt.Test <- MVN::mvn(skewt.sample, mvnTest="mardia", univariatePlot="qq") 


ais.female <- data.frame(Ht=c(195.9, 189.7, 177.8, 185, 184.6, 174, 186.2, 173.8, 171.4, 179.9, 193.4, 188.7, 169.1, 177.9, 177.5, 179.6, 181.3, 179.7, 185.2, 177.3, 179.3, 175.3, 174, 183.3, 184.7, 180.2, 180.2, 176, 156, 179.7, 180.9, 179.5, 178.9, 182.1, 186.3, 176.8, 172.6, 176, 169.9, 183, 178.2, 177.3, 174.1, 173.6, 173.7, 178.7, 183.3, 174.4, 173.3, 168.6, 174, 176, 172.2, 182.7, 180.5, 179.8, 179.6, 171.7, 170, 170, 180.5, 173.3, 173.5, 181, 175, 170.3, 165, 169.8, 174.1, 175, 171.1, 172.7, 175.6, 171.6, 172.3, 171.4, 178, 162, 167.3, 162, 170.8, 163, 166.1, 176, 163.9, 173, 177, 168, 172, 167.9, 177.5, 162.5, 172.5, 166.7, 175, 157.9, 158.9, 156.9, 148.9, 149), 
                         Wt=c(78.9, 74.4, 69.1, 74.9, 64.6, 63.7, 75.2, 62.3, 66.5, 62.9, 96.3, 75.5, 63, 80.5, 71.3, 70.5, 73.2, 68.7, 80.5, 72.9, 74.5, 75.4, 69.5, 66.4, 79.7, 73.6, 78.7, 75, 49.8, 67.2, 66, 74.3, 78.1, 79.5, 78.5, 59.9, 63, 66.3, 60.7, 72.9, 67.9, 67.5, 74.1, 68.2, 68.8, 75.3, 67.4, 70, 74, 51.9, 74.1, 74.3, 77.8, 66.9, 83.8, 82.9, 64.1, 68.85, 64.8, 59, 72.1, 75.6, 71.4, 69.7, 63.9, 55.1, 60, 58, 64.7, 87.5, 78.9, 83.9, 82.8, 74.4, 94.8, 49.2, 61.9, 53.6, 63.7, 52.8, 65.2, 50.9, 57.3, 60, 60.1, 52.5, 59.7, 57.3, 59.6, 71.5, 69.7, 56.1, 61.1, 47.4, 56, 45.8, 47.8, 43.8, 37.8, 45.1)
                         )
str(ais.female)
## 'data.frame':    100 obs. of  2 variables:
##  $ Ht: num  196 190 178 185 185 ...
##  $ Wt: num  78.9 74.4 69.1 74.9 64.6 63.7 75.2 62.3 66.5 62.9 ...
# Fit skew-normal parameters
fit.ais <- sn::msn.mle(y = cbind(ais.female$Ht, ais.female$Wt), opt.method = "BFGS")

# Print the skewness parameters
fit.ais$dp$alpha
## [1] -1.292446 -1.000158
# Fit skew-normal parameters
fit.ais <- sn::msn.mle(y = ais.female[, c("Ht", "Wt")])

# Print the skewness parameters
fit.ais$dp$alpha
##        Ht        Wt 
## -1.292450 -1.000187

Chapter 4 - Principal Component Analysis and Multidimensional Scaling

Principal Component Analysis:

  • Creation of uncorrelated (orthogonal) components that can be linearly combined to create the original dataset
    • PC1 explains the maximum variance
    • PC2 is orthogonal to PC1 and explains the maximum remaining variance
    • PC3 is orthogonal to PC1/PC2 and explains the maximum remaning variance
    • princomp(x, cor = FALSE, scores = TRUE) # if cor=TRUE, use the correlation/covariance matrix to create PCA rather than the original data provided
  • Can run an example using the mtcars dataset
    • mtcars.sub <- mtcars[ , -c(8,9)]
    • cars.pca <- princomp(mtcars.sub, cor = TRUE, scores = TRUE)

Choosing the number of components:

  • Several methods for assessing the ideal number of PC to use
    • screeplot(cars.pca, type = “lines”) # pick an elbow point as the cutoff
    • summary(cars.pca) # pick the number of components that explain X% of the variance, with X specified prior to analysis

Interpreting PCA attributes:

  • There are many attributes of the principal components object, especially when scores=TRUE is set
    • cars.pca <- princomp(mtcars.sub, cor = TRUE, scores = TRUE)
    • cars.pca$loadings # or loadings(cars.pca)
    • biplot(cars.pca, col = c(“gray”,“steelblue”), cex = c(0.5, 1.3))
  • PCA scores are the projection of the original dataset on the principal components
    • head(cars.pca$scores)
    • head(cars.pca$scores[, 1:2])
    • biplot(cars.pca, col = c(“steelblue”, “white”), cex = c(0.8, 0.01))
    • scores <- data.frame(cars.pca$scores)
    • ggplot(data = scores, aes(x = Comp.1, y = Comp.2, label = rownames(scores))) + geom_text(size = 4, col = “steelblue”)
    • cylinder <- factor(mtcars$cyl)
    • ggplot(data = scores, aes(x = Comp.1, y = Comp.2, label = rownames(scores), color = cylinder)) + geom_text(size = 4)
  • Can use functions from the factoextra library
    • fviz_pca_biplot(cars.pca)
    • fviz_pca_ind(cars.pca)
    • fviz_pca_var(cars.pca)

Multi-dimensional scaling:

  • Placing objects conceptually in a 2D or 3D space such that distances match as closesly as possible - MDS (multi-dimensional scaling)
    • cmdscale(d, k = 2, …) # default k=2
    • usloc <- cmdscale(UScitiesD)
    • ggplot(data = data.frame(usloc), aes(x = X1, y = X2, label = rownames(usloc))) + geom_text()
  • Can apply MDS using the mtcars dataset
    • cars.dist <- dist(mtcars)
    • cars.mds <- cmdscale(cars.dist, k = 2)
    • cars.mds <- data.frame(cars.mds)
    • ggplot(data = cars.mds, aes(x = X1, y = X2, label = rownames(cars.mds))) + geom_text()
  • Can run in higher dimensions by changing the k= parameter
    • cars.dist <- dist(mtcars)
    • cmds3 <- data.frame(cmdscale(cars.dist, k = 3))
    • scatterplot3d(cmds3, type = “h”, pch = 19, lty.hplot = 2)

Wrap-Up:

  • Reading and reformatting data
  • Summary statistics - Mean Vector, Variance-covariance matrix, Correlation matrix
  • Plotting in 2D and 3D
  • Multivariate probability distributions: Normal, T, Skew-normal, Skew-t
  • Dimension reduction: PCA, MDS

Example code includes:

data(state)
str(state.x77)
##  num [1:50, 1:8] 3615 365 2212 2110 21198 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
par(mfrow=c(1, 1))
par(mfcol=c(1, 1))


# Calculate PCs
pca.state <- princomp(state.x77, cor=TRUE, scores=TRUE) 

# Plot the PCA object  
plot(pca.state) 

# Print the summary of the PCs
summary(pca.state) 
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     1.8970755 1.2774659 1.0544862 0.84113269 0.62019488
## Proportion of Variance 0.4498619 0.2039899 0.1389926 0.08843803 0.04808021
## Cumulative Proportion  0.4498619 0.6538519 0.7928445 0.88128252 0.92936273
##                            Comp.6    Comp.7     Comp.8
## Standard deviation     0.55449226 0.3800642 0.33643379
## Proportion of Variance 0.03843271 0.0180561 0.01414846
## Cumulative Proportion  0.96779544 0.9858515 1.00000000
# Variance explained by each PC
pca.var <- pca.state$sdev^2  

# Proportion of variance explained by each PC
pca.pvar <- pca.var/sum(pca.var) 


# Proportion of variance explained by each principal component
pca.pvar
##     Comp.1     Comp.2     Comp.3     Comp.4     Comp.5     Comp.6 
## 0.44986195 0.20398990 0.13899264 0.08843803 0.04808021 0.03843271 
##     Comp.7     Comp.8 
## 0.01805610 0.01414846
# Cumulative variance explained plot
plot(cumsum(pca.pvar), xlab = "Principal component", 
     ylab = "Cumulative Proportion of variance explained", ylim = c(0,1), type = 'b')
grid()

# Add a horizontal line
abline(h=0.95, col="blue")

# Draw screeplot
screeplot(pca.state, type = "l")
grid()

# Create dataframe of scores
scores.state <- data.frame(pca.state$scores)

# Plot of scores labeled by state name
ggplot(data = scores.state, aes(x = Comp.1, y = Comp.2, label = rownames(scores.state))) + 
    geom_text( alpha = 0.8, size = 3) + 
    ggtitle("PCA of states data")

# Create dataframe of scores
scores.state <- data.frame(pca.state$scores)

# Plot of scores colored by region
ggplot(data=scores.state, aes(x=Comp.1, y=Comp.2, label=rownames(scores.state), color=state.region)) + 
    geom_text(alpha = 0.8, size = 3) + 
    ggtitle("PCA of states data colored by region")

# Plot the scores
factoextra::fviz_pca_ind(pca.state)

# Plot the PC loadings
factoextra::fviz_pca_var(pca.state)

# Create a biplot
factoextra::fviz_pca_biplot(pca.state)

# Calculate distance 
state.dist <- dist(state.x77)

# Perform multidimensional scaling 
mds.state <- cmdscale(state.dist) 
mds.state_df <- data.frame(mds.state)

# Plot the representation of the data in two dimensions 
ggplot(data = mds.state_df, aes(x = X1, y = X2, label = rownames(mds.state), color = state.region)) + 
    geom_text(alpha = 0.8, size = 3)

# Calculate distance 
wine.dist <- dist(wine[, -1])

# Perform multidimensional scaling 
mds.wine <- cmdscale(wine.dist, k=3) 
mds.wine_df <- data.frame(mds.wine)

# Plot the representation of the data in three dimensions 
scatterplot3d::scatterplot3d(mds.wine_df, color = wine$Type, pch = 19, type = "h", lty.hplot = 2)


Intermediate Functional Programming with purrr

Chapter 1 - Programming with purrr

Refresher of purrr Basics:

  • The map() function is one of the most basic purrr calls
    • map(.x, .f, .) # for each element of .x do .f
  • OpenData files available from French city St Malo
    • JSON format; nested list
  • The map() function will always return a list by default
    • res <- map(visit_2015, sum) # returns a list
  • Can override to other preferred outputs, such as map_dbl()
    • res <- map_dbl(visit_2015, sum) # returns a numeric
  • Can also extend to map2(.x, .y, .f, .) which resolves to do .f(.x, .y, .)
    • res <- map2(visit_2015, visit_2016, sum)
    • res <- map2_dbl(visit_2015, visit_2016, sum)
  • Can use pmap() to run operations on 3+ items, though these need to be passed in as a list
    • l <- list(visit_2014, visit_2015, visit_2016)
    • res <- pmap(l, sum)
    • res <- pmap_dbl(l, sum)

Introduction to mappers:

  • The .f is the action element - function applied to every element, number n to extract the nth element, character vector of named elements to extract
  • The functions can either be regular functions or lambda (anonymous) functions
    • map_dbl(visit_2014, function(x) { round(mean(x)) })
  • The anonymous function with a one-sided formula can be written in any of several ways
    • map_dbl(visits2017, ~ round(mean(.x))) # typically the default
    • map_dbl(visits2017, ~ round(mean(.)))
    • map_dbl(visits2017, ~ round(mean(..1)))
    • map2(visits2016, visits2017, ~ .x + .y)
    • map2(visits2016, visits2017, ~ ..1 + ..2)
  • Can extend to data with more than 2 parameters
    • pmap(list, ~ ..1 + ..2 + ..3)
  • Can use as_mapper to create mapper objects from lambda functions
    • round_mean <- function(x){ round(mean(x)) }
    • round_mean <- as_mapper(~ round(mean(.x))))
  • Mappers have several benefits
    • More concise
    • Easier to read than functions
    • Reusable

Using Mappers to Clean Data:

  • Can use set_names from purrr to set the names of a list
    • visits2016 <- set_names(visits2016, month.abb)
    • all_visits <- list(visits2015, visits2016, visits2017)
    • named_all_visits <- map(all_visits, ~ set_names(.x, month.abb))
  • The keep() function extracts elements that satisfy a condition
    • over_30000 <- keep(visits2016, ~ sum(.x) > 30000)
    • limit <- as_mapper(~ sum(.x) > 30000)
    • over_mapper <- keep(visits2016, limit)
  • The discard() function removes elements that satisfy a condition
    • under_30000 <- discard(visits2016, ~ sum(.x) > 30000)
    • limit <- as_mapper(~ sum(.x) > 30000)
    • under_mapper <- discard(visits2016, limit)
    • names(under_mapper)
  • Can use keep() and discard() with map() to clean up lists
    • df_list <- list(iris, airquality) %>% map(head)
    • map(df_list, ~ keep(.x, is.factor))

Predicates:

  • Predicates return either TRUE or FALSE - example of is.numeric()
  • Predicate functionals take an element and a predicate, and then use the predicate on the element
    • keep(airquality, is.numeric) # keep all elements that return TRUE when run against the predicate
  • There are also extensions of every() and some()
    • every(visits2016, is.numeric)
    • every(visits2016, ~ mean(.x) > 1000)
    • some(visits2016, ~ mean(.x) > 1000)
  • The detect_index() returns the first and last element that satisfies a condition
    • detect_index(visits2016, ~ mean(.x) > 1000) # index of first element that satisfies
    • detect_index(visits2016, ~ mean(.x) > 1000, .right = TRUE) # index of last element that satisfies
  • The detect() returns the value rather than the index
    • detect(visits2016, ~ mean(.x) > 1000, .right = TRUE)
  • The has_element() tests whether an object contains an item
    • visits2016_mean <- map(visits2016, mean)
    • has_element(visits2016_mean,981)

Example code includes:

visit_a <- c(117, 147, 131, 73, 81, 134, 121)
visit_b <- c(180, 193, 116, 166, 131, 153, 146)
visit_c <- c(57, 110, 68, 72, 87, 141, 67)

# Create the to_day function
to_day <- function(x) {
    x*24
}

# Create a list containing both vectors: all_visits
all_visits <- list(visit_a, visit_b)

# Convert to daily number of visits: all_visits_day
all_visits_day <- map(all_visits, to_day)

# Map the mean() function and output a numeric vector 
map_dbl(all_visits_day, mean)
## [1] 2756.571 3720.000
# You'll test out both map() and walk() for plotting
# Both return the "side effects," that is to say, the changes in the environment (drawing plots, downloading a file, changing the working directory...), but walk() won't print anything to the console.

# Create all_tests list  and modify with to_day() function
all_tests <- list(visit_a, visit_b, visit_c)
all_tests_day <- map(all_tests, to_day)

# Plot all_tests_day with map
map(all_tests_day, barplot)

## [[1]]
##      [,1]
## [1,]  0.7
## [2,]  1.9
## [3,]  3.1
## [4,]  4.3
## [5,]  5.5
## [6,]  6.7
## [7,]  7.9
## 
## [[2]]
##      [,1]
## [1,]  0.7
## [2,]  1.9
## [3,]  3.1
## [4,]  4.3
## [5,]  5.5
## [6,]  6.7
## [7,]  7.9
## 
## [[3]]
##      [,1]
## [1,]  0.7
## [2,]  1.9
## [3,]  3.1
## [4,]  4.3
## [5,]  5.5
## [6,]  6.7
## [7,]  7.9
# Plot all_tests_day
walk(all_tests_day, barplot)

# Get sum of all visits and class of sum_all
sum_all <- pmap(all_tests_day, sum)
class(sum_all)
## [1] "list"
# Turn visit_a into daily number using an anonymous function
map(visit_a, function(x) { x*24 })
## [[1]]
## [1] 2808
## 
## [[2]]
## [1] 3528
## 
## [[3]]
## [1] 3144
## 
## [[4]]
## [1] 1752
## 
## [[5]]
## [1] 1944
## 
## [[6]]
## [1] 3216
## 
## [[7]]
## [1] 2904
# Turn visit_a into daily number of visits by using a mapper
map(visit_a, ~.x*24)
## [[1]]
## [1] 2808
## 
## [[2]]
## [1] 3528
## 
## [[3]]
## [1] 3144
## 
## [[4]]
## [1] 1752
## 
## [[5]]
## [1] 1944
## 
## [[6]]
## [1] 3216
## 
## [[7]]
## [1] 2904
# Create a mapper object called to_day
to_day <- as_mapper(~.x*24)

# Use it on the three vectors
map(visit_a, to_day)
## [[1]]
## [1] 2808
## 
## [[2]]
## [1] 3528
## 
## [[3]]
## [1] 3144
## 
## [[4]]
## [1] 1752
## 
## [[5]]
## [1] 1944
## 
## [[6]]
## [1] 3216
## 
## [[7]]
## [1] 2904
map(visit_b, to_day)
## [[1]]
## [1] 4320
## 
## [[2]]
## [1] 4632
## 
## [[3]]
## [1] 2784
## 
## [[4]]
## [1] 3984
## 
## [[5]]
## [1] 3144
## 
## [[6]]
## [1] 3672
## 
## [[7]]
## [1] 3504
map(visit_c, to_day)
## [[1]]
## [1] 1368
## 
## [[2]]
## [1] 2640
## 
## [[3]]
## [1] 1632
## 
## [[4]]
## [1] 1728
## 
## [[5]]
## [1] 2088
## 
## [[6]]
## [1] 3384
## 
## [[7]]
## [1] 1608
# Round visit_a to the nearest tenth with a mapper
map_dbl(visit_a, ~ round(.x, -1))
## [1] 120 150 130  70  80 130 120
# Create to_ten, a mapper that rounds to the nearest tenth
to_ten <- as_mapper(~ round(.x, -1))

# Map to_ten on visit_b
map_dbl(visit_b, to_ten)
## [1] 180 190 120 170 130 150 150
# Map to_ten on visit_c
map_dbl(visit_c, to_ten)
## [1]  60 110  70  70  90 140  70
# Create a mapper that test if .x is more than 100 
is_more_than_hundred <- as_mapper(~ .x > 100)

# Run this mapper on the all_visits object
map(all_visits, ~ keep(.x, is_more_than_hundred) )
## [[1]]
## [1] 117 147 131 134 121
## 
## [[2]]
## [1] 180 193 116 166 131 153 146
# Use the  day vector to set names to all_list
day <- c("mon", "tue", "wed", "thu", "fri", "sat", "sun")
full_visits_named <- map(all_visits, ~ set_names(.x, day))

# Use this mapper with keep() 
map(full_visits_named, ~ keep(.x, is_more_than_hundred))
## [[1]]
## mon tue wed sat sun 
## 117 147 131 134 121 
## 
## [[2]]
## mon tue wed thu fri sat sun 
## 180 193 116 166 131 153 146
# Set the name of each subvector
day <- c("mon", "tue", "wed", "thu", "fri", "sat", "sun")
all_visits_named <- map(all_visits, ~ set_names(.x, day))

# Create a mapper that will test if .x is over 100 
threshold <- as_mapper(~.x > 100)

# Run this mapper on the all_visits object: group_over
group_over <- map(all_visits, ~ keep(.x, threshold) )

# Run this mapper on the all_visits object: group_under
group_under <-  map(all_visits, ~ discard(.x, threshold) )


# Create a threshold variable, set it to 160
threshold <- 160

# Create a mapper that tests if .x is over the defined threshold
over_threshold <- as_mapper(~ .x > threshold)

# Are all elements in every all_visits vectors over the defined threshold? 
map(all_visits, ~ every(.x, over_threshold))
## [[1]]
## [1] FALSE
## 
## [[2]]
## [1] FALSE
# Are some elements in every all_visits vectors over the defined threshold? 
map(all_visits, ~ some(.x, over_threshold))
## [[1]]
## [1] FALSE
## 
## [[2]]
## [1] TRUE

Chapter 2 - Functional Programming from Theory to Practice

Functional Programming in R:

  • Everything that exists is an object and everything that happens is a function call
    • This means that a function is an object and can be treated as such
    • Every action in R is performed by a function
    • Functions are first-class citizens, and behave like any other object
    • Functions can be manipulated, stored as variables, lambda (anonymous), stored in a list, arguments of a function, returned by a function
    • R is a functional programming language
  • In a “pure function”, output depends only on input, and there are no side-effects (no changes to the environment)
    • Sys.Date() depends on the enviornment and is thus not pure
    • write.csv() is called solely for the side effect (writing a file) and is thus not pure

Tools for Functional Programming in purrr:

  • A high order function can take functions as input and return functions as output
    • nop_na <- function(fun){
    • function(…){ fun(…, na.rm = TRUE) }
    • }
    • sd_no_na <- nop_na(sd)
    • sd_no_na( c(NA, 1, 2, NA) )
  • There are three types of high-order functions
    • Functionals take another function and return a vector - like map()
    • Function factories take a vector and create a function
    • Function operators take functions and return functions - considered to be “adverbs”
  • Two of the most common adverbs in purrr are safely() and possibly()
    • The safely() call returns a function that will return $result and $error when run; helpful for diagnosing issues with code rather than losing the information
    • safe_log <- safely(log)
    • safe_log(“a”) # there will be $result of NULL and $error with the error code
    • map( list(2, “a”), safely(log) )

Using possibly():

  • The possibly() function is an adverb that returns either the value of the function OR the value specified in the otherwise element
    • possible_sum <- possibly(sum, otherwise = “nop”)
    • possible_sum(“a”) # result will be “nop”
  • Note that possibly() cannot be made to run a function; it will just return a pre-specified value

Handling adverb results:

  • Can use transpose() to change the output (converts the list to inside out)
    • Transpose turn a list of n elements a and b to a list of a and b, with each n elements
  • The compact() function will remove the NULL elements
    • l <- list(1,2,3,“a”)
    • possible_log <- possibly(log, otherwise = NULL)
    • map(l, possible_log) %>% compact()
  • Can use the httr package specifically for http requests
    • httr::GET(url) will return the value from attempting to connect to url - 200 is good, 404 is unavailable, etc.

Example code includes:

# `$` is a function call, of a special type called 'infix operator', as they are put between two elements, and can be used without parenthesis.

# Launch Sys.time(), Sys.sleep(1), & Sys.time()
Sys.time()
## [1] "2019-03-05 08:26:47 CST"
Sys.sleep(1)
Sys.time()
## [1] "2019-03-05 08:26:48 CST"
data(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Launch nrow(iris), Sys.sleep(1), & nrow(iris)
nrow(iris)
## [1] 150
Sys.sleep(1)
nrow(iris)
## [1] 150
# Launch ls(), create an object, then rerun the ls() function
# ls()
# this <- 12
# ls()

# Create a plot of the iris dataset
plot(iris)

urls <- c('https://thinkr.fr', 'https://colinfay.me', 'http://not_working.org', 'https://datacamp.com', 'http://cran.r-project.org/', 'https://not_working_either.org')


# Create a safe version of read_lines()
safe_read <- safely(readr::read_lines)

# Map it on the urls vector
res <- map(urls, safe_read)

# Set the name of the results to `urls`
named_res <-  set_names(res, urls)

# Extract only the "error" part of each sublist
map(named_res, "error")
## $`https://thinkr.fr`
## NULL
## 
## $`https://colinfay.me`
## NULL
## 
## $`http://not_working.org`
## <simpleError in open.connection(con, "rb"): Could not resolve host: not_working.org>
## 
## $`https://datacamp.com`
## NULL
## 
## $`http://cran.r-project.org/`
## NULL
## 
## $`https://not_working_either.org`
## <simpleError in open.connection(con, "rb"): Could not resolve host: not_working_either.org>
# Code a function that discard() the NULL from safe_read()
safe_read_discard <- function(url){
    safe_read(url) %>%
        discard(is.null)
}

# Map this function on the url list
res <- map(urls, safe_read_discard)


# Create a possibly() version of read_lines()
possible_read <- possibly(readr::read_lines, otherwise = 404)

# Map this function on urls, pipe it into set_names()
res <- map(urls, possible_read) %>% set_names(urls)

# Paste each element of the list 
res_pasted <- map(res, paste, collapse=" ")

# Keep only the elements which are equal to 404
keep(res_pasted, ~ .x == 404)
## $`http://not_working.org`
## [1] "404"
## 
## $`https://not_working_either.org`
## [1] "404"
url_tester <- function(url_list){
    url_list %>%
        # Map a version of read_lines() that otherwise returns 404
        map( possibly(readr::read_lines, otherwise = 404) ) %>%
        # Set the names of the result
        set_names( urls ) %>% 
        # paste() and collapse each element
        map(paste, collapse =" ") %>%
        # Remove the 404 
        discard(~.x==404) %>%
        names() # Will return the names of the good ones
}

# Try this function on the urls object
url_tester(urls)
## [1] "https://thinkr.fr"          "https://colinfay.me"       
## [3] "https://datacamp.com"       "http://cran.r-project.org/"
url_tester <- function(url_list, type = c("result", "error")){
    res <- url_list %>%
        # Create a safely() version of read_lines() 
        map( safely(readr::read_lines) ) %>%
        set_names( url_list ) %>%
        # Transpose into a list of $result and $error
        transpose() 
    # Complete this if statement
    if (type == "result") return( res$result ) 
    if (type == "error") return( res$error ) 
}

# Try this function on the urls object
url_tester(urls, type = "error") 
## $`https://thinkr.fr`
## NULL
## 
## $`https://colinfay.me`
## NULL
## 
## $`http://not_working.org`
## <simpleError in open.connection(con, "rb"): Could not resolve host: not_working.org>
## 
## $`https://datacamp.com`
## NULL
## 
## $`http://cran.r-project.org/`
## NULL
## 
## $`https://not_working_either.org`
## <simpleError in open.connection(con, "rb"): Could not resolve host: not_working_either.org>
url_tester <- function(url_list){
    url_list %>%
        # Map a version of GET() that would otherwise return NULL 
        map( possibly(httr::GET, otherwise=NULL) ) %>%
        # Set the names of the result
        set_names( urls ) %>%
        # Remove the NULL
        compact() %>%
        # Extract all the "status_code" elements
        map("status_code")
}

# Try this function on the urls object
url_tester(urls)
## $`https://thinkr.fr`
## [1] 200
## 
## $`https://colinfay.me`
## [1] 200
## 
## $`https://datacamp.com`
## [1] 200
## 
## $`http://cran.r-project.org/`
## [1] 200

Chapter 3 - Better Code with purrr

Rationale for cleaner code:

  • Cleaner code is easier to debug (spot typos), easier to interpret, and easier to modify
    • tidy_iris_lm <- compose( as_mapper(~ filter(.x, p.value < 0.05)), tidy, partial(lm, data=iris, na.action = na.fail) )
    • list( Petal.Length ~ Petal.Width, Petal.Width ~ Sepal.Width, Sepal.Width ~ Sepal.Length ) %>% map(tidy_iris_lm)
  • Clean code characteristics
    • Light - no unnecessary code
    • Readable - less repition makes for easier reading (one piece of code for one task)
    • Interpretable
    • Maintainable
  • The compose() function is used to compose a function from two or more functions
    • rounded_mean <- compose(round, mean)

Building functions with compose() and negate():

  • There is a limitless amount of functions that can be included in compose()
    • clean_aov <- compose(tidy, anova, lm)
  • Can use negate() to flip the predicate - TRUE becomes FALSE and FALSE becomes TRUE
    • is_not_na <- negate(is.na)
    • under_hundred <- as_mapper(~ mean(.x) < 100)
    • not_under_hundred <- negate(under_hundred)
    • map_lgl(98:102, under_hundred)
    • map_lgl(98:102, not_under_hundred)
  • The “good” status return codes from GET() are in the low-200s
    • good_status <- c(200, 201, 202, 203)
    • status %in% good_status

Prefilling functions:

  • The partial() allows for pre-filling a function
    • mean_na_rm <- partial(mean, na.rm = TRUE)
    • lm_iris <- partial(lm, data = iris)
  • Can also combine partial() and compose()
    • rounded_mean <- compose( partial(round, digits = 2), partial(mean, na.rm = TRUE) )
  • Can use functions from rvest for web scraping
    • read_html()
    • html_nodes()
    • html_text()
    • html_attr()

List columns:

  • A list column is part of a nested data frame - one or more of the data frame columns is itself a list (requires use of tibble rather than data.frame)
    • df <- tibble( classic = c(“a”, “b”,“c”), list = list( c(“a”, “b”,“c”), c(“a”, “b”,“c”, “d”), c(“a”, “b”,“c”, “d”, “e”) ) )
    • a_node <- partial(html_nodes, css = “a”)
    • href <- partial(html_attr, name = “href”)
    • get_links <- compose( href, a_node, read_html )
    • urls_df <- tibble( urls = c(“https://thinkr.fr”, “https://colinfay.me”, “https://datacamp.com”, “http://cran.r-project.org/”) )
    • urls_df %>% mutate(links = map(urls, get_links))
  • Can also unnest the data from the list columns
    • urls_df %>% mutate(links = map(urls, get_links)) %>% unnest()
  • Can also nest() a standard data.frame
    • iris_n <- iris %>% group_by(Species) %>% tidyr::nest()
  • Since the list column is a list, the purrr functions can be run on them
    • iris_n %>% mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)))
    • summary_lm <- compose(summary, lm)
    • iris %>% group_by(Species) %>% nest() %>% mutate(data = map(data, ~ summary_lm(Sepal.Length ~ Sepal.Width, data = .x)), data = map(data, “r.squared”)) %>% unnest()

Example code includes:

urls <- c('https://thinkr.fr', 'https://colinfay.me', 'http://not_working.org', 'https://datacamp.com', 'http://cran.r-project.org/', 'https://not_working_either.org')


# Compose a status extractor (compose is also an igraph function)
status_extract <- purrr::compose(httr::status_code, httr::GET)

# Try with "https://thinkr.fr" & "http://datacamp.com"
status_extract("https://thinkr.fr")
## [1] 200
status_extract("http://datacamp.com")
## [1] 200
# Map it on the urls vector, return a vector of numbers
oldUrls <- urls
urls <- oldUrls[c(1, 2, 4, 5)]
map_dbl(urls, status_extract)
## [1] 200 200 200 200
# Negate the %in% function 
`%not_in%` <- negate(`%in%`)

# Compose a status extractor 
status_extract <- purrr::compose(httr::status_code, httr::GET)

# Complete the function
strict_code <- function(url){
    code <- status_extract(url)
    if (code %not_in% c(200:203)){ return(NA) } else { return(code) } 
}


# Map the strict_code function on the urls vector
res <- map(urls, strict_code)

# Set the names of the results using the urls vector
res_named <- set_names(res, urls)

# Negate the is.na function
is_not_na <- negate(is.na)

# Run is_not_na on the results
is_not_na(res_named)
##          https://thinkr.fr        https://colinfay.me 
##                       TRUE                       TRUE 
##       https://datacamp.com http://cran.r-project.org/ 
##                       TRUE                       TRUE
# Prefill html_nodes() with the css param set to h2
get_h2 <- partial(rvest::html_nodes, css="h2")

# Combine the html_text, get_h2 and read_html functions
get_content <- purrr::compose(rvest::html_text, get_h2, xml2::read_html)

# Map get_content to the urls list
res <- map(urls, get_content) %>%
    set_names(urls)

# Print the results to the console
res
## $`https://thinkr.fr`
##  [1] "Conseil, développement et formation au logiciel R"                                          
##  [2] "Formez-vous au logiciel R !"                                                                
##  [3] "\r\n\t\tPédagogie de la formation au langage R\r\n\t"                                          
##  [4] "\r\n\t\tRetour sur les projets R des étudiants du MSc X-HEC Data Science for Business\r\n\t"   
##  [5] "\r\n\t\tConstruisons la certification R du RConsortium\r\n\t"                                  
##  [6] "\r\n\t\tLes tests statistiques\r\n\t"                                                          
##  [7] "\r\n\t\tÀ la découverte de RStudio Package Manager\r\n\t"                                      
##  [8] "\r\n\t\tLes pièges de la représentation de données\r\n\t"                                      
##  [9] "\r\n\t\tDBI : Distributeur des Brasseurs Indépendants ? Non DataBase Interface\r\n\t"          
## [10] "\r\n\t\tQuoi de neuf {ggplot2} ?\r\n\t"                                                        
## [11] "\r\n\t\tComparaison entre Excel et R : Analyses statistiques et graphiques\r\n\t"              
## 
## $`https://colinfay.me`
## [1] "\n      \n        Watch if R is running from Shiny\n\n      \n    "        
## [2] "\n      \n        An Introduction to Docker for R Users\n\n      \n    "   
## [3] "\n      \n        2018 through {cranlogs}\n\n      \n    "                 
## [4] "\n      \n        Solving #AdventOfCode day 5 and 6 with R\n\n      \n    "
## [5] "\n      \n        Solving #AdventOfCode day 3 and 4 with R\n\n      \n    "
## 
## $`https://datacamp.com`
## character(0)
## 
## $`http://cran.r-project.org/`
## character(0)
# Create a partial version of html_nodes(), with the css param set to "a"
a_node <- partial(rvest::html_nodes, css="a")

# Create href(), a partial version of html_attr()
href <- partial(rvest::html_attr, name = "href")

# Combine href(), a_node(), and read_html()
get_links <- purrr::compose(href, a_node, xml2::read_html)

# Map get_links() to the urls list
res <- map(urls, get_links) %>%
    set_names(urls)


df <- tibble::tibble(urls=urls)
df
## # A tibble: 4 x 1
##   urls                      
##   <chr>                     
## 1 https://thinkr.fr         
## 2 https://colinfay.me       
## 3 https://datacamp.com      
## 4 http://cran.r-project.org/
# Create a "links" columns, by mapping get_links() on urls
df2 <- df %>%
    mutate(links = map(urls, get_links)) 

# Print df2 to see what it looks like
df2
## # A tibble: 4 x 2
##   urls                       links      
##   <chr>                      <list>     
## 1 https://thinkr.fr          <chr [147]>
## 2 https://colinfay.me        <chr [33]> 
## 3 https://datacamp.com       <chr [92]> 
## 4 http://cran.r-project.org/ <chr [1]>
# unnest() df2 to have a tidy dataframe
df2 %>%
    tidyr::unnest()
## # A tibble: 273 x 2
##    urls            links                                                  
##    <chr>           <chr>                                                  
##  1 https://thinkr~ https://thinkr.fr/                                     
##  2 https://thinkr~ https://thinkr.fr/                                     
##  3 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/             
##  4 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/introduction~
##  5 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/statistique-~
##  6 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/programmatio~
##  7 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/r-et-le-big-~
##  8 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/r-pour-la-fi~
##  9 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/integration-~
## 10 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/formation-si~
## # ... with 263 more rows

Chapter 4 - Case Study

Discovering the Dataset:

  • The dataset is available from https://github.com/ThinkR-open/datasets
    • rstudioconf: a list of 5055 tweets
    • length(rstudioconf)
    • length(rstudioconf[[1]])
    • purrr::vec_depth(rstudioconf)
  • JSON is a standard data format for the web, and typically consists of key-value pairs which are read as nested lists by R
  • Refresher of keep() and discard() usage
    • keep(1:10, ~ .x < 5)
    • discard(1:10, ~ .x < 5)

Extracting Information from the Dataset:

  • Can manipulate functions for list cleaning using high-order functions - includes partial() and compose()
    • sum_no_na <- partial(sum, na.rm = TRUE)
    • map_dbl(airquality, sum_no_na)
    • rounded_sum <- compose(round, sum_no_na)
    • map_dbl(airquality, rounded_sum)
  • Can also clean lists using compact() to remove NULL and flatten() to remove one level from a nested list
    • l <- list(NULL, 1, 2, 3, NULL)
    • compact(l)
    • my_list <- list( list(a = 1), list(b = 2) )
    • flatten(my_list)

Manipulating URL:

  • Can use the mapper functions to create a re-usable function
    • mult <- as_mapper(~ .x * 2)
  • Can use str_detect inside the mapper function
    • lyrics <- c(“Is this the real life?”, “Is this just fantasy?”, “Caught in a landslide”, “No escape from reality”)
    • stringr::str_detect(a, “life”)

Identifying Influencers:

  • Can use the map_at() function to run a function at a specific portion of the list
    • my_list <- list( a = 1:10, b = 1:100, c = 12 )
    • map_at(.x = my_list, .at = “b”, .f = sum)
  • Can also use negate() to reverse the actio of a predicate
    • not_character <- negate(is.character)
    • my_list <- list( a = 1:10, b = “a”, c = iris )
    • map(my_list, not_character)

Wrap up:

  • Lambda functions and reusable mappers
    • map(1:5, ~ .x*10)
    • ten_times <- as_mapper(~ .x * 10)
    • map(1:5, ten_times)
  • Function manipulation using functionals (functions that take functions as inputs and return vectors)
    • map() & friends
    • keep() & discard()
    • some() & every()
  • Function operators take functions and return (modified) functions
    • safely() & possibly()
    • partial()
    • compose()
    • negate()
  • Cleaner code is easier to read, understand, and maintain
    • rounded_mean <- compose( partial(round, digits = 1), partial(mean, trim = 2, na.rm = TRUE) )
    • map( list(airquality, mtcars), ~ map_dbl(.x, rounded_mean) )

Example code includes:

rstudioconfDF <- readRDS("./RInputFiles/#RStudioConf.RDS")
dim(rstudioconfDF)
## [1] 5055   42
rstudioconf <- as.list(as.data.frame(t(rstudioconfDF)))
length(rstudioconf)
## [1] 5055
length(rstudioconf[[1]])
## [1] 42
# Print the first element of the list to the console 
rstudioconf[[1]]
## $status_id
## [1] "960732355773239296"
## 
## $created_at
## [1] 1517891417
## 
## $user_id
## [1] "626266741"
## 
## $screen_name
## [1] "grod_rf"
## 
## $text
## [1] "RT @dataandme: <U+0001F41C> Check it, @ajmcoqui's \"Debugging in RStudio\" \n<U+0001F4FD> Slides *and* cheat sheet!\nhttps://t.co/rAvKP9iXLa #rstats #rstudioconf htt…"
## 
## $source
## [1] "Twitter for Android"
## 
## $reply_to_status_id
## [1] NA
## 
## $reply_to_user_id
## [1] NA
## 
## $reply_to_screen_name
## [1] NA
## 
## $is_quote
## [1] FALSE
## 
## $is_retweet
## [1] TRUE
## 
## $favorite_count
## [1] 0
## 
## $retweet_count
## [1] 7
## 
## $hashtags
## [1] "rstats"      "rstudioconf"
## 
## $symbols
## [1] NA
## 
## $urls_url
## [1] "buff.ly/2s7W8ED"
## 
## $urls_t.co
## [1] "https://t.co/rAvKP9iXLa"
## 
## $urls_expanded_url
## [1] "https://buff.ly/2s7W8ED"
## 
## $media_url
## [1] NA
## 
## $media_t.co
## [1] NA
## 
## $media_expanded_url
## [1] NA
## 
## $media_type
## [1] NA
## 
## $ext_media_url
## [1] NA
## 
## $ext_media_t.co
## [1] NA
## 
## $ext_media_expanded_url
## [1] NA
## 
## $ext_media_type
## [1] NA
## 
## $mentions_user_id
## [1] "3230388598"         "732925397814247426"
## 
## $mentions_screen_name
## [1] "dataandme" "ajmcoqui" 
## 
## $lang
## [1] "en"
## 
## $quoted_status_id
## [1] NA
## 
## $quoted_text
## [1] NA
## 
## $retweet_status_id
## [1] "960600422556880896"
## 
## $retweet_text
## [1] "<U+0001F41C> Check it, @ajmcoqui's \"Debugging in RStudio\" \n<U+0001F4FD> Slides *and* cheat sheet!\nhttps://t.co/rAvKP9iXLa #rstats #rstudioconf https://t.co/T4627GcuXK"
## 
## $place_url
## [1] NA
## 
## $place_name
## [1] NA
## 
## $place_full_name
## [1] NA
## 
## $place_type
## [1] NA
## 
## $country
## [1] NA
## 
## $country_code
## [1] NA
## 
## $geo_coords
## [1] NA NA
## 
## $coords_coords
## [1] NA NA
## 
## $bbox_coords
## [1] NA NA NA NA NA NA NA NA
# Create a sublist of non-retweets
non_rt <- discard(rstudioconf, "is_retweet")

# Extract the favorite count element of each non_rt sublist
fav_count <- map_dbl(non_rt, "favorite_count")

# Get the median of favorite_count for non_rt
median(fav_count)
## [1] 1
# Keep the RT, extract the user_id, remove the duplicate
rt <- keep(rstudioconf, "is_retweet") %>%
    map("user_id") %>% 
    unique()

# Remove the RT, extract the user id, remove the duplicate
non_rt <- discard(rstudioconf, "is_retweet") %>%
    map("user_id") %>% 
    unique()

# Determine the total number of users
union(rt, non_rt) %>% length()
## [1] 1742
# Determine the number of users who has just retweeted
setdiff(rt, non_rt) %>% length()
## [1] 1302
# Prefill mean() with na.rm, and round() with digits = 1
mean_na_rm <- partial(mean, na.rm=TRUE)
round_one <- partial(round, digits=1)

# Compose a rounded_mean function
rounded_mean <- purrr::compose(round_one, mean_na_rm)

# Extract the non retweet  
non_rt <- discard(rstudioconf, "is_retweet")

# Extract "favorite_count", and pass it to rounded_mean()
map_dbl(non_rt, "favorite_count") %>%
    rounded_mean()
## [1] 3.3
# Combine as_vector(), compact(), and flatten()
flatten_to_vector <- purrr::compose(as_vector, compact, flatten)

# Complete the fonction
extractor <- function(list, what = "mentions_screen_name"){
    map(list, what) %>%
        flatten_to_vector()
}

# Create six_most, with tail(), sort(), and table()
six_most <- purrr::compose(tail, sort, table)

# Run extractor() on rstudioconf
extractor(rstudioconf) %>% 
    six_most()
## .
##    JennyBryan hadleywickham      AmeliaMN    juliasilge          drob 
##           278           308           362           376           418 
##       rstudio 
##           648
# Extract the "urls_url" elements, and flatten() the result
urls_clean <- map(rstudioconf, "urls_url") %>%
    flatten()

# Remove the NULL
compact_urls <- compact(urls_clean)
compact_urls <- discard(urls_clean, is.na)  # Due to creation of the list above, NULL became NA

# Create a mapper that detects the patten "github"
has_github <- as_mapper(~ str_detect(.x, "github"))

# Look for the "github" pattern, and sum the result
map_lgl( compact_urls, has_github ) %>%
    sum()
## [1] 347
# Complete the function
ratio_pattern <- function(vec, pattern){
    n_pattern <- str_detect(vec, pattern) %>% sum()
    n_pattern / length(vec)
}

# Create flatten_and_compact()
extraDiscard <- function(x) { discard(x, is.na) }  # address same NA issue as above
flatten_and_compact <- purrr::compose(compact, extraDiscard, flatten)

# Complete the pipe to get the ratio of URLs with "github"
map(rstudioconf, "urls_url") %>%
    flatten_and_compact() %>% 
    ratio_pattern("github")
## [1] 0.2943172
# Create mean_above, a mapper that tests if .x is over 3.3
mean_above <- as_mapper(~ .x > 3.3)

# Prefil map_at() with "retweet_count", mean_above for above, 
# and mean_above negation for below
above <- partial(map_at, .at = "retweet_count", .f = mean_above )
below <- partial(map_at, .at = "retweet_count", .f = negate(mean_above) )

# Map above() and below() on non_rt, keep the "retweet_count"
# ab <- map(non_rt, above) %>% keep("retweet_count")
# bl <- map(non_rt, below) %>% keep("retweet_count")

# Compare the size of both elements
# length(ab)
# length(bl)


# Get the max() of "retweet_count" 
max_rt <- map_dbl(non_rt, "retweet_count") %>% 
    max()

# Prefill map_at() with a mapper testing if .x equal max_rt
# max_rt_calc <- partial(map_at, .at = "retweet_count", .f = ~.x==max_rt )

# Map max_rt_calc on non_rt, keep the retweet_count & flatten
# res <- map(non_rt, max_rt_calc) %>% 
#     keep("retweet_count") %>% 
#     flatten()

# Print the "screen_name" and "text" of the result
# res$screen_name
# res$text

Foundations of Functional Programming with purrr

Chapter 1 - Simplifying Iteration and Lists with purrr

The power of iteration:

  • Iteration is the process of repeating commands over elements of a dataset
    • d <- list()
    • for(i in 1:10){ d[[i]] <- read_csv(files[i]) }
  • Typos are a risk when using for loops; purrr simplifies this by reducing everything to the commands map_*()
    • map(.x=object, .f=function)
    • d <- map(files, read_csv)
  • Can use the bird_counts data as an example
    • bird_sum <- map(bird_counts, sum)

Subsetting lists:

  • Lists can store multiple data types - data frames, models, numbers, text, whatever
  • Indexing a data frame
    • mtcars[1, “wt”]
    • mtcars$wt
  • List indexing differs from data frames and vectors
    • lo[[2]] # pull the second element
    • lo[[“model”]] # pull by name
  • Calculate something on each element with purrr
    • map(survey_data, ~nrow(.x)) # The ~ symbolizes that this is a user-written function

Multiple flavors of map():

  • May want to output data in a different class, and the map_*() help with this
    • map_dbl(survey_data, ~nrow(.x)) # return as a vector of doubles
    • map_lgl(survey_data, ~nrow(.x)==14) # return as logical vector (boolean)
    • map_chr(species_names, ~.x) # return as character vector
    • survey_rows <- data.frame(names = names(survey_data), rows = NA)
    • survey_rows$rows <- map_dbl(survey_data, ~nrow(.x))

Example code includes:

files <- list('data_from_1990.csv', 'data_from_1991.csv', 'data_from_1992.csv', 'data_from_1993.csv', 'data_from_1994.csv', 'data_from_1995.csv', 'data_from_1996.csv', 'data_from_1997.csv', 'data_from_1998.csv', 'data_from_1999.csv', 'data_from_2000.csv', 'data_from_2001.csv', 'data_from_2002.csv', 'data_from_2003.csv', 'data_from_2004.csv', 'data_from_2005.csv'
              )
files <- map(files, function(x) { paste0("./RInputFiles/", x) })


# Initialize list
all_files <- list()

# For loop to read files into a list
for(i in seq_along(files)){
  all_files[[i]] <- readr::read_csv(file = files[[i]])
}
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
# Output size of list object
length(all_files)
## [1] 16
# Use map to iterate
all_files_purrr <- purrr::map(files, read_csv) 
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
## Parsed with column specification:
## cols(
##   years = col_double(),
##   a = col_double(),
##   b = col_double()
## )
# Output size of list object
length(all_files_purrr)
## [1] 16
temp <- c("1", "2", "3", "4")
list_of_df <- list(temp, temp, temp, temp, temp, temp, temp, temp, temp, temp)


# Check the class type of the first element
class(list_of_df[[1]])
## [1] "character"
# Change each element from a character to a number
for(i in seq_along(list_of_df)){
    list_of_df[[i]] <- as.numeric(list_of_df[[i]])
}

# Check the class type of the first element
class(list_of_df[[1]])
## [1] "numeric"
# Print out the list
list_of_df
## [[1]]
## [1] 1 2 3 4
## 
## [[2]]
## [1] 1 2 3 4
## 
## [[3]]
## [1] 1 2 3 4
## 
## [[4]]
## [1] 1 2 3 4
## 
## [[5]]
## [1] 1 2 3 4
## 
## [[6]]
## [1] 1 2 3 4
## 
## [[7]]
## [1] 1 2 3 4
## 
## [[8]]
## [1] 1 2 3 4
## 
## [[9]]
## [1] 1 2 3 4
## 
## [[10]]
## [1] 1 2 3 4
temp <- c("1", "2", "3", "4")
list_of_df <- list(temp, temp, temp, temp, temp, temp, temp, temp, temp, temp)


# Check the class type of the first element
class(list_of_df[[1]])  
## [1] "character"
# Change each character element to a number
list_of_df <- map(list_of_df, as.numeric)

# Check the class type of the first element again
class(list_of_df[[1]])
## [1] "numeric"
# Print out the list
list_of_df
## [[1]]
## [1] 1 2 3 4
## 
## [[2]]
## [1] 1 2 3 4
## 
## [[3]]
## [1] 1 2 3 4
## 
## [[4]]
## [1] 1 2 3 4
## 
## [[5]]
## [1] 1 2 3 4
## 
## [[6]]
## [1] 1 2 3 4
## 
## [[7]]
## [1] 1 2 3 4
## 
## [[8]]
## [1] 1 2 3 4
## 
## [[9]]
## [1] 1 2 3 4
## 
## [[10]]
## [1] 1 2 3 4
# Load wesanderson dataset
data(wesanderson, package="repurrrsive")

# Get structure of first element in wesanderson
str(wesanderson[[1]])
##  chr [1:4] "#F1BB7B" "#FD6467" "#5B1A18" "#D67236"
# Get structure of GrandBudapest element in wesanderson
str(wesanderson$GrandBudapest)
##  chr [1:4] "#F1BB7B" "#FD6467" "#5B1A18" "#D67236"
# Third element of the first wesanderson vector
wesanderson[[1]][3]
## [1] "#5B1A18"
# Fourth element of the GrandBudapest wesanderson vector
wesanderson$GrandBudapest[4]
## [1] "#D67236"
data(sw_films, package="repurrrsive")


# Subset the first element of the sw_films data
sw_films[[1]]
## $title
## [1] "A New Hope"
## 
## $episode_id
## [1] 4
## 
## $opening_crawl
## [1] "It is a period of civil war.\r\nRebel spaceships, striking\r\nfrom a hidden base, have won\r\ntheir first victory against\r\nthe evil Galactic Empire.\r\n\r\nDuring the battle, Rebel\r\nspies managed to steal secret\r\nplans to the Empire's\r\nultimate weapon, the DEATH\r\nSTAR, an armored space\r\nstation with enough power\r\nto destroy an entire planet.\r\n\r\nPursued by the Empire's\r\nsinister agents, Princess\r\nLeia races home aboard her\r\nstarship, custodian of the\r\nstolen plans that can save her\r\npeople and restore\r\nfreedom to the galaxy...."
## 
## $director
## [1] "George Lucas"
## 
## $producer
## [1] "Gary Kurtz, Rick McCallum"
## 
## $release_date
## [1] "1977-05-25"
## 
## $characters
##  [1] "http://swapi.co/api/people/1/"  "http://swapi.co/api/people/2/" 
##  [3] "http://swapi.co/api/people/3/"  "http://swapi.co/api/people/4/" 
##  [5] "http://swapi.co/api/people/5/"  "http://swapi.co/api/people/6/" 
##  [7] "http://swapi.co/api/people/7/"  "http://swapi.co/api/people/8/" 
##  [9] "http://swapi.co/api/people/9/"  "http://swapi.co/api/people/10/"
## [11] "http://swapi.co/api/people/12/" "http://swapi.co/api/people/13/"
## [13] "http://swapi.co/api/people/14/" "http://swapi.co/api/people/15/"
## [15] "http://swapi.co/api/people/16/" "http://swapi.co/api/people/18/"
## [17] "http://swapi.co/api/people/19/" "http://swapi.co/api/people/81/"
## 
## $planets
## [1] "http://swapi.co/api/planets/2/" "http://swapi.co/api/planets/3/"
## [3] "http://swapi.co/api/planets/1/"
## 
## $starships
## [1] "http://swapi.co/api/starships/2/"  "http://swapi.co/api/starships/3/" 
## [3] "http://swapi.co/api/starships/5/"  "http://swapi.co/api/starships/9/" 
## [5] "http://swapi.co/api/starships/10/" "http://swapi.co/api/starships/11/"
## [7] "http://swapi.co/api/starships/12/" "http://swapi.co/api/starships/13/"
## 
## $vehicles
## [1] "http://swapi.co/api/vehicles/4/" "http://swapi.co/api/vehicles/6/"
## [3] "http://swapi.co/api/vehicles/7/" "http://swapi.co/api/vehicles/8/"
## 
## $species
## [1] "http://swapi.co/api/species/5/" "http://swapi.co/api/species/3/"
## [3] "http://swapi.co/api/species/2/" "http://swapi.co/api/species/1/"
## [5] "http://swapi.co/api/species/4/"
## 
## $created
## [1] "2014-12-10T14:23:31.880000Z"
## 
## $edited
## [1] "2015-04-11T09:46:52.774897Z"
## 
## $url
## [1] "http://swapi.co/api/films/1/"
# Subset the first element of the sw_films data, title column 
sw_films[[1]]$title
## [1] "A New Hope"
# Map over wesanderson to get the length of each element
map(wesanderson, length)
## $GrandBudapest
## [1] 4
## 
## $Moonrise1
## [1] 4
## 
## $Royal1
## [1] 4
## 
## $Moonrise2
## [1] 4
## 
## $Cavalcanti
## [1] 5
## 
## $Royal2
## [1] 5
## 
## $GrandBudapest2
## [1] 4
## 
## $Moonrise3
## [1] 5
## 
## $Chevalier
## [1] 4
## 
## $Zissou
## [1] 5
## 
## $FantasticFox
## [1] 5
## 
## $Darjeeling
## [1] 5
## 
## $Rushmore
## [1] 5
## 
## $BottleRocket
## [1] 7
## 
## $Darjeeling2
## [1] 5
# Map over wesanderson, and determine the length of each element
map(wesanderson, ~length(.x))
## $GrandBudapest
## [1] 4
## 
## $Moonrise1
## [1] 4
## 
## $Royal1
## [1] 4
## 
## $Moonrise2
## [1] 4
## 
## $Cavalcanti
## [1] 5
## 
## $Royal2
## [1] 5
## 
## $GrandBudapest2
## [1] 4
## 
## $Moonrise3
## [1] 5
## 
## $Chevalier
## [1] 4
## 
## $Zissou
## [1] 5
## 
## $FantasticFox
## [1] 5
## 
## $Darjeeling
## [1] 5
## 
## $Rushmore
## [1] 5
## 
## $BottleRocket
## [1] 7
## 
## $Darjeeling2
## [1] 5
# Map over wesanderson and determine the length of each element
map(wesanderson, length)
## $GrandBudapest
## [1] 4
## 
## $Moonrise1
## [1] 4
## 
## $Royal1
## [1] 4
## 
## $Moonrise2
## [1] 4
## 
## $Cavalcanti
## [1] 5
## 
## $Royal2
## [1] 5
## 
## $GrandBudapest2
## [1] 4
## 
## $Moonrise3
## [1] 5
## 
## $Chevalier
## [1] 4
## 
## $Zissou
## [1] 5
## 
## $FantasticFox
## [1] 5
## 
## $Darjeeling
## [1] 5
## 
## $Rushmore
## [1] 5
## 
## $BottleRocket
## [1] 7
## 
## $Darjeeling2
## [1] 5
# Create a numcolors column and fill with length of each wesanderson element
data.frame(numcolors = map_dbl(wesanderson, ~length(.x)))
##                numcolors
## GrandBudapest          4
## Moonrise1              4
## Royal1                 4
## Moonrise2              4
## Cavalcanti             5
## Royal2                 5
## GrandBudapest2         4
## Moonrise3              5
## Chevalier              4
## Zissou                 5
## FantasticFox           5
## Darjeeling             5
## Rushmore               5
## BottleRocket           7
## Darjeeling2            5

Chapter 2 - More Complex Iterations

Working with unnamed lists:

  • The purrr package works well with the piping (%>%) operator
    • names(survey_data)
    • survey_data %>% names()
    • sw_films <- sw_films %>% set_names(map_chr(sw_films, “title”)) # make the sw_films list in to a named list based on the title of each movie in the last (sw_films is Star Wars films from repurrrsive)
    • map(waterfowl_data, ~.x %>% sum() %>% log())

More map():

  • Can use map() to simulate data, run models, test models, etc.
    • list_of_df <- map(list_of_means, ~data.frame(a=rnorm(mean = .x, n = 200, sd = (5/2))))
    • models <- education_data %>% map(~ lm(income ~ education_level, data=.x)) %>% map(summary)
    • map(livingthings, ~.x[[“species”]])
  • There are many flavors of map_*() which can be applied
    • map_lgl - logical vector output
    • map_dbl - double vector output
    • map_int - integer vector output
    • map_chr - character vector output
    • map_dbl(bird_measurements, ~.x[[“wing length”]])
  • Can use map_df to create a data frame, and using the data_frame() function
    • bird_measurements %>% map_df(~ data_frame(weight=.x[[“weight”]], wing_length = .x[[“wing length”]]))

map2() and pmap():

  • The map2() function allows for pulling out information from two inputs - .x is list 1 and .y is list 2
    • simdata <- map2(list_of_means, list_of_sd, ~data.frame(a = rnorm(mean=.x, n=200, sd=.y), b = rnorm(mean=200, n=200, sd=15)))
  • Can also use the pmap() function for as many lists as we want to use (takes a list of lists as an input)
    • input_list <- list( means = list_of_means, sd = list_of_sd, samplesize = list_of_samplesize)
    • simdata <- pmap(inputs_list, function(means, sd, samplesize) data.frame(a = rnorm(mean=means, n=samplesize, sd=sd)))

Example code includes:

# Use pipes to check for names in sw_films
sw_films %>%
    names()
## NULL
# Set names so each element of the list is named for the film title
sw_films_named <- sw_films %>% 
  set_names(map_chr(., "title"))

# Check to see if the names worked/are correct
names(sw_films_named)
## [1] "A New Hope"              "Attack of the Clones"   
## [3] "The Phantom Menace"      "Revenge of the Sith"    
## [5] "Return of the Jedi"      "The Empire Strikes Back"
## [7] "The Force Awakens"
# Create a list of values from 1 through 10
numlist <- list(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

# Iterate over the numlist 
map(numlist, ~.x %>% sqrt() %>% sin())
## [[1]]
## [1] 0.841471
## 
## [[2]]
## [1] 0.9877659
## 
## [[3]]
## [1] 0.9870266
## 
## [[4]]
## [1] 0.9092974
## 
## [[5]]
## [1] 0.7867491
## 
## [[6]]
## [1] 0.6381576
## 
## [[7]]
## [1] 0.4757718
## 
## [[8]]
## [1] 0.3080717
## 
## [[9]]
## [1] 0.14112
## 
## [[10]]
## [1] -0.02068353
# List of sites north, east, and west
sites <- list("north", "east", "west")

# Create a list of dataframes, each with a years, a, and b column 
list_of_df <-  map(sites,  
  ~data.frame(years = .x,
       a = rnorm(mean = 5, n = 200, sd = 5/2),
       b = rnorm(mean = 200, n = 200, sd = 15)))

map(list_of_df, head)
## [[1]]
##   years        a        b
## 1 north 5.419245 185.6379
## 2 north 3.497293 191.3937
## 3 north 7.542372 203.1696
## 4 north 1.237095 200.3886
## 5 north 7.965691 202.3536
## 6 north 5.997626 198.6213
## 
## [[2]]
##   years        a        b
## 1  east 3.959230 210.2183
## 2  east 7.794994 202.4002
## 3  east 1.787426 215.2296
## 4  east 7.324983 211.6860
## 5  east 3.200095 167.6267
## 6  east 1.898869 176.9505
## 
## [[3]]
##   years        a        b
## 1  west 1.650724 200.3956
## 2  west 5.358391 196.1960
## 3  west 6.100584 193.0619
## 4  west 3.674544 226.0657
## 5  west 7.548899 183.7324
## 6  west 7.492224 218.4433
# Map over the models to look at the relationship of a vs b
list_of_df %>%
    map(~ lm(a ~ b, data = .)) %>%
    map(~summary(.))
## [[1]]
## 
## Call:
## lm(formula = a ~ b, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9588 -1.7913  0.2551  1.9562  8.1414 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  5.624601   2.752832   2.043   0.0424 *
## b           -0.003813   0.013789  -0.277   0.7824  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.763 on 198 degrees of freedom
## Multiple R-squared:  0.0003861,  Adjusted R-squared:  -0.004662 
## F-statistic: 0.07647 on 1 and 198 DF,  p-value: 0.7824
## 
## 
## [[2]]
## 
## Call:
## lm(formula = a ~ b, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3600 -1.8732 -0.3591  2.0309  7.6241 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  5.573714   2.427035   2.297   0.0227 *
## b           -0.003159   0.012032  -0.263   0.7931  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.722 on 198 degrees of freedom
## Multiple R-squared:  0.0003481,  Adjusted R-squared:  -0.004701 
## F-statistic: 0.06895 on 1 and 198 DF,  p-value: 0.7931
## 
## 
## [[3]]
## 
## Call:
## lm(formula = a ~ b, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1929 -1.6531 -0.0416  1.3332  7.8433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.28039    2.38101   3.898 0.000133 ***
## b           -0.02109    0.01185  -1.781 0.076507 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.553 on 198 degrees of freedom
## Multiple R-squared:  0.01576,    Adjusted R-squared:  0.01079 
## F-statistic: 3.171 on 1 and 198 DF,  p-value: 0.07651
# Pull out the director element of sw_films in a list and character vector
map(sw_films, ~.x[["director"]])
## [[1]]
## [1] "George Lucas"
## 
## [[2]]
## [1] "George Lucas"
## 
## [[3]]
## [1] "George Lucas"
## 
## [[4]]
## [1] "George Lucas"
## 
## [[5]]
## [1] "Richard Marquand"
## 
## [[6]]
## [1] "Irvin Kershner"
## 
## [[7]]
## [1] "J. J. Abrams"
map_chr(sw_films, ~.x[["director"]])
## [1] "George Lucas"     "George Lucas"     "George Lucas"    
## [4] "George Lucas"     "Richard Marquand" "Irvin Kershner"  
## [7] "J. J. Abrams"
# Compare outputs when checking if director is George Lucas
map(sw_films, ~.x[["director"]] == "George Lucas")
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] FALSE
## 
## [[6]]
## [1] FALSE
## 
## [[7]]
## [1] FALSE
map_lgl(sw_films, ~.x[["director"]] == "George Lucas")
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
# Pull out episode_id element as list
map(sw_films, ~.x[["episode_id"]])
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 3
## 
## [[5]]
## [1] 6
## 
## [[6]]
## [1] 5
## 
## [[7]]
## [1] 7
# Pull out episode_id element as double vector
map_dbl(sw_films, ~.x[["episode_id"]])
## [1] 4 2 1 3 6 5 7
# Pull out episode_id element as list
map(sw_films, ~.x[["episode_id"]])
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 3
## 
## [[5]]
## [1] 6
## 
## [[6]]
## [1] 5
## 
## [[7]]
## [1] 7
# Pull out episode_id element as integer vector
map_int(sw_films, ~.x[["episode_id"]])
## [1] 4 2 1 3 6 5 7
# List of 1 through 3
means <- list(1, 2, 3)

# Create sites list
sites <- list("north", "west", "east")

# Map over two arguments: years and mu
list_of_files_map2 <- map2(sites, means, ~data.frame(sites = .x,
                           a = rnorm(mean = .y, n = 200, sd = (5/2))))

map(list_of_files_map2, head)
## [[1]]
##   sites          a
## 1 north  0.5819726
## 2 north -0.5746658
## 3 north  0.1643412
## 4 north -2.1005153
## 5 north -2.2215248
## 6 north -3.4375990
## 
## [[2]]
##   sites          a
## 1  west  1.1073218
## 2  west -1.0274293
## 3  west  3.6507615
## 4  west  0.6089474
## 5  west -1.8287453
## 6  west  3.3056101
## 
## [[3]]
##   sites         a
## 1  east 9.5459135
## 2  east 2.0063471
## 3  east 4.9414991
## 4  east 4.9324294
## 5  east 2.7412434
## 6  east 0.4964416
means <- list(1, 2, 3)
sigma <- list(1, 2, 3)
means2 <- list(0.5, 1, 1.5)
sigma2 <- list(0.5, 1, 1.5)


# Create a master list, a list of lists
pmapinputs <- list(sites = sites,  means1 = means, sigma1 = sigma, 
                   means2 = means2, sigma2 = sigma2)

# Map over the master list
list_of_files_pmap <- pmap(pmapinputs, 
                           function(sites, means1, sigma1, means2, sigma2) {
                               data.frame(years = sites, 
                                          a = rnorm(mean = means1, n = 200, sd = sigma1), 
                                          b = rnorm(mean = means2, n = 200, sd = sigma2)
                                          )
                               }
                           )
                           
map(list_of_files_pmap, head)
## [[1]]
##   years          a          b
## 1 north  0.4063538 0.09759733
## 2 north  2.0099744 0.11524764
## 3 north -0.8131808 0.62864546
## 4 north -0.1670964 0.71861601
## 5 north  0.7262024 0.32455889
## 6 north  0.6275768 0.73348169
## 
## [[2]]
##   years          a         b
## 1  west  2.5286327 1.7629410
## 2  west  1.9901082 0.6423823
## 3  west  4.0098040 0.2195289
## 4  west -0.4119218 0.6716433
## 5  west  2.0980822 2.1730813
## 6  west  3.8050698 0.9769756
## 
## [[3]]
##   years          a          b
## 1  east  2.9726321 -0.9428948
## 2  east  0.2710128  0.5352085
## 3  east  0.3507006  1.1646529
## 4  east  3.0210231  1.7266897
## 5  east -3.0043878  2.2508411
## 6  east  1.3111049  1.5101823

Chapter 3 - Troubleshooting Lists with purrr

How to purrr safely():

  • Can be QA/QC challenges if elements of the large list are not as expected
  • The map_*() will only work if the list is as expected
  • Can use the safely() command to find out where the issues are occurring, keep running, and putting a default output using the otherwise= element
    • a <- list(“unknown”, 10) %>% map(safely(function(x) x * 10, otherwise = NA_real_))
    • Each element has both an output element and an error-flagging element
  • Can pipe the transpose function to make it easier to investigate the error messages
    • a <- list(“unknown”,10) %>% map(safely(function(x) x * 10, otherwise = NA_real_)) %>% transpose()

Another way to possibly() purrr:

  • The possibly() function helps to get past the errors in the input list
  • Typical workflow is to find the issues using safely(), then address them using possibly()
    • a <- list(-10, “unknown”, 10) %>% map(safely(function(x) x * 10, otherwise = NA_real_))
    • a <- list(-10, “unknown”, 10) %>% map(possibly(function(x) x * 10, otherwise = NA_real_))

purr is a walk() in the park:

  • The walk() function helps create human-readable results in a compact manner
    • short_list <- list(-10, 1, 10)
    • walk(short_list, print)
  • The walk() function has both a .x and a .f element
    • walk(plist, print) # no output created for the console

Example code includes:

# Map safely over log
a <- list(-10, 1, 10, 0) %>% 
    map(safely(log, otherwise = NA_real_)) %>%
    # Transpose the result
    transpose()
## Warning in .f(...): NaNs produced
# Print the list
a
## $result
## $result[[1]]
## [1] NaN
## 
## $result[[2]]
## [1] 0
## 
## $result[[3]]
## [1] 2.302585
## 
## $result[[4]]
## [1] -Inf
## 
## 
## $error
## $error[[1]]
## NULL
## 
## $error[[2]]
## NULL
## 
## $error[[3]]
## NULL
## 
## $error[[4]]
## NULL
# Print the result element in the list
a[["result"]]
## [[1]]
## [1] NaN
## 
## [[2]]
## [1] 0
## 
## [[3]]
## [1] 2.302585
## 
## [[4]]
## [1] -Inf
# Print the error element in the list
a[["error"]]
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
# Load sw_people data
data(sw_people, package="repurrrsive")


# Map over sw_people and pull out the height element
height_cm <- map(sw_people, "height") %>%
    map(function(x) { ifelse(x == "unknown", NA, as.numeric(x)) })


# Map over sw_people and pull out the height element
height_ft <- map(sw_people , "height") %>% 
    map(safely(function(x){ ifelse(x == "unknown", NA, as.numeric(x) * 0.0328084) }, quiet = FALSE)) %>%
    transpose()

# Print your list, the result element, and the error element
walk(height_ft, function(x) { print(x[1:10]) })
## [[1]]
## [1] 5.643045
## 
## [[2]]
## [1] 5.479003
## 
## [[3]]
## [1] 3.149606
## 
## [[4]]
## [1] 6.627297
## 
## [[5]]
## [1] 4.92126
## 
## [[6]]
## [1] 5.839895
## 
## [[7]]
## [1] 5.413386
## 
## [[8]]
## [1] 3.182415
## 
## [[9]]
## [1] 6.003937
## 
## [[10]]
## [1] 5.971129
## 
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
height_ft[["result"]][1:10]
## [[1]]
## [1] 5.643045
## 
## [[2]]
## [1] 5.479003
## 
## [[3]]
## [1] 3.149606
## 
## [[4]]
## [1] 6.627297
## 
## [[5]]
## [1] 4.92126
## 
## [[6]]
## [1] 5.839895
## 
## [[7]]
## [1] 5.413386
## 
## [[8]]
## [1] 3.182415
## 
## [[9]]
## [1] 6.003937
## 
## [[10]]
## [1] 5.971129
height_ft[["error"]][1:10]
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
# Take the log of each element in the list
a <- list(-10, 1, 10, 0) %>% 
    map(possibly(function(x){ log(x) }, otherwise=NA_real_))
## Warning in log(x): NaNs produced
# Create a piped workflow that returns double vectors
height_cm %>%  
    map_dbl(possibly(function(x){ x * 0.0328084 }, otherwise=NA_real_)) 
##  [1] 5.643045 5.479003 3.149606 6.627297 4.921260 5.839895 5.413386
##  [8] 3.182415 6.003937 5.971129 6.167979 5.905512 7.480315 5.905512
## [15] 5.675853 5.741470 5.577428 5.905512 2.165354 5.577428 6.003937
## [22] 6.561680 6.233596 5.807087 5.741470 5.905512 4.921260       NA
## [29] 2.887139 5.249344 6.332021 6.266404 5.577428 6.430446 7.349082
## [36] 6.758530 6.003937 4.494751 3.674541 6.003937 5.347769 5.741470
## [43] 5.905512 5.839895 3.083990 4.002625 5.347769 6.167979 6.496063
## [50] 6.430446 5.610236 6.036746 6.167979 8.661418 6.167979 6.430446
## [57] 6.069554 5.150919 6.003937 6.003937 5.577428 5.446194 5.413386
## [64] 6.332021 6.266404 6.003937 5.511811 6.496063 7.513124 6.988189
## [71] 5.479003 2.591864 3.149606 6.332021 6.266404 5.839895 7.086614
## [78] 7.677166 6.167979 5.839895 6.758530       NA       NA       NA
## [85]       NA       NA 5.413386
films <- map_chr(sw_films, "url")
people <- map(sw_films, "characters")

people_by_film <- tibble::tibble(films = rep(films, times=map_int(people, length)), 
                                 film_url = unlist(people)
                                 )

# Print with walk
walk(people_by_film, print)
##   [1] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/1/"
##   [3] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/1/"
##   [5] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/1/"
##   [7] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/1/"
##   [9] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/1/"
##  [11] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/1/"
##  [13] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/1/"
##  [15] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/1/"
##  [17] "http://swapi.co/api/films/1/" "http://swapi.co/api/films/1/"
##  [19] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [21] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [23] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [25] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [27] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [29] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [31] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [33] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [35] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [37] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [39] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [41] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [43] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [45] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [47] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [49] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [51] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [53] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [55] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [57] "http://swapi.co/api/films/5/" "http://swapi.co/api/films/5/"
##  [59] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [61] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [63] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [65] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [67] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [69] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [71] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [73] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [75] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [77] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [79] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [81] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [83] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [85] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [87] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [89] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [91] "http://swapi.co/api/films/4/" "http://swapi.co/api/films/4/"
##  [93] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
##  [95] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
##  [97] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
##  [99] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [101] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [103] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [105] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [107] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [109] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [111] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [113] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [115] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [117] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [119] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [121] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [123] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [125] "http://swapi.co/api/films/6/" "http://swapi.co/api/films/6/"
## [127] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [129] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [131] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [133] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [135] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [137] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [139] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [141] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [143] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [145] "http://swapi.co/api/films/3/" "http://swapi.co/api/films/3/"
## [147] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/2/"
## [149] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/2/"
## [151] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/2/"
## [153] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/2/"
## [155] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/2/"
## [157] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/2/"
## [159] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/2/"
## [161] "http://swapi.co/api/films/2/" "http://swapi.co/api/films/2/"
## [163] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/7/"
## [165] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/7/"
## [167] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/7/"
## [169] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/7/"
## [171] "http://swapi.co/api/films/7/" "http://swapi.co/api/films/7/"
## [173] "http://swapi.co/api/films/7/"
##   [1] "http://swapi.co/api/people/1/"  "http://swapi.co/api/people/2/" 
##   [3] "http://swapi.co/api/people/3/"  "http://swapi.co/api/people/4/" 
##   [5] "http://swapi.co/api/people/5/"  "http://swapi.co/api/people/6/" 
##   [7] "http://swapi.co/api/people/7/"  "http://swapi.co/api/people/8/" 
##   [9] "http://swapi.co/api/people/9/"  "http://swapi.co/api/people/10/"
##  [11] "http://swapi.co/api/people/12/" "http://swapi.co/api/people/13/"
##  [13] "http://swapi.co/api/people/14/" "http://swapi.co/api/people/15/"
##  [15] "http://swapi.co/api/people/16/" "http://swapi.co/api/people/18/"
##  [17] "http://swapi.co/api/people/19/" "http://swapi.co/api/people/81/"
##  [19] "http://swapi.co/api/people/2/"  "http://swapi.co/api/people/3/" 
##  [21] "http://swapi.co/api/people/6/"  "http://swapi.co/api/people/7/" 
##  [23] "http://swapi.co/api/people/10/" "http://swapi.co/api/people/11/"
##  [25] "http://swapi.co/api/people/20/" "http://swapi.co/api/people/21/"
##  [27] "http://swapi.co/api/people/22/" "http://swapi.co/api/people/33/"
##  [29] "http://swapi.co/api/people/36/" "http://swapi.co/api/people/40/"
##  [31] "http://swapi.co/api/people/43/" "http://swapi.co/api/people/46/"
##  [33] "http://swapi.co/api/people/51/" "http://swapi.co/api/people/52/"
##  [35] "http://swapi.co/api/people/53/" "http://swapi.co/api/people/58/"
##  [37] "http://swapi.co/api/people/59/" "http://swapi.co/api/people/60/"
##  [39] "http://swapi.co/api/people/61/" "http://swapi.co/api/people/62/"
##  [41] "http://swapi.co/api/people/63/" "http://swapi.co/api/people/64/"
##  [43] "http://swapi.co/api/people/65/" "http://swapi.co/api/people/66/"
##  [45] "http://swapi.co/api/people/67/" "http://swapi.co/api/people/68/"
##  [47] "http://swapi.co/api/people/69/" "http://swapi.co/api/people/70/"
##  [49] "http://swapi.co/api/people/71/" "http://swapi.co/api/people/72/"
##  [51] "http://swapi.co/api/people/73/" "http://swapi.co/api/people/74/"
##  [53] "http://swapi.co/api/people/75/" "http://swapi.co/api/people/76/"
##  [55] "http://swapi.co/api/people/77/" "http://swapi.co/api/people/78/"
##  [57] "http://swapi.co/api/people/82/" "http://swapi.co/api/people/35/"
##  [59] "http://swapi.co/api/people/2/"  "http://swapi.co/api/people/3/" 
##  [61] "http://swapi.co/api/people/10/" "http://swapi.co/api/people/11/"
##  [63] "http://swapi.co/api/people/16/" "http://swapi.co/api/people/20/"
##  [65] "http://swapi.co/api/people/21/" "http://swapi.co/api/people/32/"
##  [67] "http://swapi.co/api/people/33/" "http://swapi.co/api/people/34/"
##  [69] "http://swapi.co/api/people/36/" "http://swapi.co/api/people/37/"
##  [71] "http://swapi.co/api/people/38/" "http://swapi.co/api/people/39/"
##  [73] "http://swapi.co/api/people/40/" "http://swapi.co/api/people/41/"
##  [75] "http://swapi.co/api/people/42/" "http://swapi.co/api/people/43/"
##  [77] "http://swapi.co/api/people/44/" "http://swapi.co/api/people/46/"
##  [79] "http://swapi.co/api/people/48/" "http://swapi.co/api/people/49/"
##  [81] "http://swapi.co/api/people/50/" "http://swapi.co/api/people/51/"
##  [83] "http://swapi.co/api/people/52/" "http://swapi.co/api/people/53/"
##  [85] "http://swapi.co/api/people/54/" "http://swapi.co/api/people/55/"
##  [87] "http://swapi.co/api/people/56/" "http://swapi.co/api/people/57/"
##  [89] "http://swapi.co/api/people/58/" "http://swapi.co/api/people/59/"
##  [91] "http://swapi.co/api/people/47/" "http://swapi.co/api/people/35/"
##  [93] "http://swapi.co/api/people/1/"  "http://swapi.co/api/people/2/" 
##  [95] "http://swapi.co/api/people/3/"  "http://swapi.co/api/people/4/" 
##  [97] "http://swapi.co/api/people/5/"  "http://swapi.co/api/people/6/" 
##  [99] "http://swapi.co/api/people/7/"  "http://swapi.co/api/people/10/"
## [101] "http://swapi.co/api/people/11/" "http://swapi.co/api/people/12/"
## [103] "http://swapi.co/api/people/13/" "http://swapi.co/api/people/20/"
## [105] "http://swapi.co/api/people/21/" "http://swapi.co/api/people/33/"
## [107] "http://swapi.co/api/people/46/" "http://swapi.co/api/people/51/"
## [109] "http://swapi.co/api/people/52/" "http://swapi.co/api/people/53/"
## [111] "http://swapi.co/api/people/54/" "http://swapi.co/api/people/55/"
## [113] "http://swapi.co/api/people/56/" "http://swapi.co/api/people/58/"
## [115] "http://swapi.co/api/people/63/" "http://swapi.co/api/people/64/"
## [117] "http://swapi.co/api/people/67/" "http://swapi.co/api/people/68/"
## [119] "http://swapi.co/api/people/75/" "http://swapi.co/api/people/78/"
## [121] "http://swapi.co/api/people/79/" "http://swapi.co/api/people/80/"
## [123] "http://swapi.co/api/people/81/" "http://swapi.co/api/people/82/"
## [125] "http://swapi.co/api/people/83/" "http://swapi.co/api/people/35/"
## [127] "http://swapi.co/api/people/1/"  "http://swapi.co/api/people/2/" 
## [129] "http://swapi.co/api/people/3/"  "http://swapi.co/api/people/4/" 
## [131] "http://swapi.co/api/people/5/"  "http://swapi.co/api/people/10/"
## [133] "http://swapi.co/api/people/13/" "http://swapi.co/api/people/14/"
## [135] "http://swapi.co/api/people/16/" "http://swapi.co/api/people/18/"
## [137] "http://swapi.co/api/people/20/" "http://swapi.co/api/people/21/"
## [139] "http://swapi.co/api/people/22/" "http://swapi.co/api/people/25/"
## [141] "http://swapi.co/api/people/27/" "http://swapi.co/api/people/28/"
## [143] "http://swapi.co/api/people/29/" "http://swapi.co/api/people/30/"
## [145] "http://swapi.co/api/people/31/" "http://swapi.co/api/people/45/"
## [147] "http://swapi.co/api/people/1/"  "http://swapi.co/api/people/2/" 
## [149] "http://swapi.co/api/people/3/"  "http://swapi.co/api/people/4/" 
## [151] "http://swapi.co/api/people/5/"  "http://swapi.co/api/people/10/"
## [153] "http://swapi.co/api/people/13/" "http://swapi.co/api/people/14/"
## [155] "http://swapi.co/api/people/18/" "http://swapi.co/api/people/20/"
## [157] "http://swapi.co/api/people/21/" "http://swapi.co/api/people/22/"
## [159] "http://swapi.co/api/people/23/" "http://swapi.co/api/people/24/"
## [161] "http://swapi.co/api/people/25/" "http://swapi.co/api/people/26/"
## [163] "http://swapi.co/api/people/1/"  "http://swapi.co/api/people/3/" 
## [165] "http://swapi.co/api/people/5/"  "http://swapi.co/api/people/13/"
## [167] "http://swapi.co/api/people/14/" "http://swapi.co/api/people/27/"
## [169] "http://swapi.co/api/people/84/" "http://swapi.co/api/people/85/"
## [171] "http://swapi.co/api/people/86/" "http://swapi.co/api/people/87/"
## [173] "http://swapi.co/api/people/88/"
data(gapminder, package="gapminder")
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
gap_split <- split(gapminder, gapminder$country)
gap_split[[1]]
## # A tibble: 12 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## 11 Afghanistan Asia       2002    42.1 25268405      727.
## 12 Afghanistan Asia       2007    43.8 31889923      975.
# Map over the first 10 elements of gap_split
plots <- map2(gap_split[1:10], names(gap_split[1:10]), 
              ~ ggplot(.x, aes(year, lifeExp)) + geom_line() + labs(title = .y)
              )

# Object name, then function name
walk(plots, print)


Chapter 4 - Problem Solving with purrr

Using purrr in your workflow:

  • Can set the names for a list using the purrr approach
    • sw_films <- sw_films %>% set_names(map_chr(sw_films, “title”))
    • map_chr(sw_films, ~.x[[“episode_id”]]) %>% set_names(map_chr(sw_films, “title”)) %>% sort()

Even more complex problems:

  • Values may be buried inside lists inside lists inside etc.
    • forks <- gh_repos %>% map( ~map(.x, “forks”))
    • bird_measurements %>% map_df(~ data_frame( weight = .x[[“weight”]], wing_length = .x[[“wing length”]], taxa = “bird”)) %>% select_if(is.numeric) %>% summary(.x)

Graphs in purrr:

  • Can use ggplot2 to plot the elements using purrr

Wrap up:

  • Iteration, data stored in lists, easy to read/write code

Example code includes:

# Load the data
data(gh_users, package="repurrrsive")

# Check if data has names
names(gh_users)
## NULL
# Map over name element of list
map(gh_users, ~.x[["name"]])
## [[1]]
## [1] "Gábor Csárdi"
## 
## [[2]]
## [1] "Jennifer (Jenny) Bryan"
## 
## [[3]]
## [1] "Jeff L."
## 
## [[4]]
## [1] "Julia Silge"
## 
## [[5]]
## [1] "Thomas J. Leeper"
## 
## [[6]]
## [1] "Maëlle Salmon"
# Name gh_users with the names of the users
gh_users <- gh_users %>% 
    set_names(map_chr(gh_users, "name"))


# Check gh_repos structure
data(gh_repos, package="repurrrsive")
# str(gh_repos)  # List is much too long for str() printing

# Name gh_repos with the names of the repo owner 
gh_repos_named <- gh_repos %>% 
    map_chr(~map_chr(.x, ~.x$owner$login)[1]) %>% 
    set_names(gh_repos, .)


# Determine who joined github first
map_chr(gh_users, ~.x[["created_at"]]) %>%
    set_names(map_chr(gh_users, "name")) %>%
    sort()
## Jennifer (Jenny) Bryan           Gábor Csárdi                Jeff L. 
## "2011-02-03T22:37:41Z" "2011-03-09T17:29:25Z" "2012-03-24T18:16:43Z" 
##       Thomas J. Leeper          Maëlle Salmon            Julia Silge 
## "2013-02-07T21:07:00Z" "2014-08-05T08:10:04Z" "2015-05-19T02:51:23Z"
# Determine user versus organization
map_lgl(gh_users, ~.x[["type"]] == "User") %>%
    sum() == length(gh_users)
## [1] TRUE
# Determine who has the most public repositories
map_int(gh_users, ~.x[["public_repos"]]) %>%
    set_names(map_chr(gh_users, "name")) %>%
    sort()
##            Julia Silge          Maëlle Salmon           Gábor Csárdi 
##                     26                     31                     52 
##                Jeff L.       Thomas J. Leeper Jennifer (Jenny) Bryan 
##                     67                     99                    168
# Set names of gh_repos with name subelement
gh_repos <- gh_repos %>% 
    map_chr(~map_chr(.x, ~.x$owner$login)[1]) %>% 
    set_names(gh_repos, .)

# Check to make sure list has the right names
names(gh_repos)
## [1] "gaborcsardi" "jennybc"     "jtleek"      "juliasilge"  "leeper"     
## [6] "masalmon"
# Map over gh_repos to generate numeric output
map(gh_repos, 
    ~map_dbl(.x, ~.x[["size"]])) %>%
    # Grab the largest element
    map(~max(.x))
## $gaborcsardi
## [1] 39461
## 
## $jennybc
## [1] 96325
## 
## $jtleek
## [1] 374812
## 
## $juliasilge
## [1] 24070
## 
## $leeper
## [1] 558176
## 
## $masalmon
## [1] 76455
gh_users_df <- tibble::tibble(public_repos=map_int(gh_users, ~.x[["public_repos"]]), 
                              followers=map_int(gh_users, "followers")
                              )

# Scatter plot of public repos and followers
ggplot(data = gh_users_df, aes(x = public_repos, y = followers)) + 
    geom_point()

map(gh_repos_named, "followers")
## $gaborcsardi
## NULL
## 
## $jennybc
## NULL
## 
## $jtleek
## NULL
## 
## $juliasilge
## NULL
## 
## $leeper
## NULL
## 
## $masalmon
## NULL
# Histogram of followers        
gh_users_df %>%
    ggplot(aes(x = followers)) + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Create a dataframe with four columns
map_df(gh_users, `[`, c("login", "name", "followers", "public_repos")) %>%
    # Plot followers by public_repos
    ggplot(., aes(x = followers, y = public_repos)) + 
    # Create scatter plots
    geom_point()

# Turn data into correct dataframe format
film_by_character <- tibble(filmtitle = map_chr(sw_films, "title")) %>%
    transmute(filmtitle, characters = map(sw_films, "characters")) %>%
    unnest()

# Pull out elements from sw_people
sw_characters <- map_df(sw_people, `[`, c("height", "mass", "name", "url"))

# Join the two new objects
inner_join(film_by_character, sw_characters, by = c("characters" = "url")) %>%
    # Make sure the columns are numbers
    mutate(height1 = ifelse(height=="unknown", NA, as.numeric(height)), 
           mass1 = ifelse(mass=="unknown", NA, as.numeric(stringr::str_replace(mass, ",", "")))
           ) %>% 
    filter(!is.na(height)) %>%
    ggplot(aes(x = height)) +
    geom_bar(stat="count") + 
    # geom_histogram(stat = "count") + 
    facet_wrap(~filmtitle)
## Warning in ifelse(height == "unknown", NA, as.numeric(height)): NAs
## introduced by coercion
## Warning in ifelse(mass == "unknown", NA,
## as.numeric(stringr::str_replace(mass, : NAs introduced by coercion


Joining Data in R with data.table

Chapter 1 - Joining Multiple data.tables

Introduction:

  • Combining data from multiple data set can be valuable for analysis
  • Need to identfy the table keys for a successful join
    • demographics <- data.table(name = c(“Trey”, “Matthew”, “Angela”), gender = c(NA, “M”, “F”), age = c(54, 43, 39))
    • shipping <- data.table(name = c(“Matthew”, “Trey”, “Angela”), address = c(“7 Mill road”, “12 High street”, “33 Pacific boulevard”))
    • tables() # will show all the data.table in the session along with rows, columns, space, and keys str(demographics)

Merge function:

  • Can run all of inner, outer, left, and right joins using merge()
    • merge(x = demographics, y = shipping, by.x = “name”, by.y = “name”) # default is an inner merge
    • merge(x = demographics, y = shipping, by = “name”) # if the tables have the same key name
    • merge(x = demographics, y = shipping, by = “name”, all = TRUE) # full outer join

Left and right joins:

  • Can run left joins or right joins using all.x and all.y
    • merge(x = demographics, y = shipping, by = “name”, all.x = TRUE)
    • merge(x = demographics, y = shipping, by = “name”, all.y = TRUE)
    • merge(x = demographics, y = shipping, by = “name”, all.y = TRUE) is the same as merge(x = shipping, y = demographics, by = “name”, all.x = TRUE)

Example code includes:

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:xts':
## 
##     first, last
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
netflix <- fread("./RInputFiles/netflix_2017.csv", sep=",")
imdb <- fread("./RInputFiles/imdb_ratings.csv", sep=",")


# What data.tables are in my R session?
tables()
##       NAME NROW NCOL MB                        COLS KEY
## 1:    imdb   11    2  0                title,rating    
## 2: netflix    8    3  0 title,episodes,release_date    
## Total: 0MB
# View the first six rows 
head(netflix)
##                             title episodes release_date
## 1: A Series of Unfortunate Events        8    13-Jan-17
## 2:                 13 Reasons Why       13    31-Mar-17
## 3:                          Gypsy       10    30-Jun-17
## 4:                          Ozark       10    21-Jul-17
## 5:                     Mindhunter       10    13-Oct-17
## 6:                       Longmire       10    17-Nov-17
head(imdb)
##             title rating
## 1:    The Orville    7.7
## 2:      Big Mouth    8.3
## 3:     The Gifted    8.2
## 4:          Gypsy    7.0
## 5:       Inhumans    5.4
## 6: 13 Reasons Why    8.4
# Print the structure
str(netflix)
## Classes 'data.table' and 'data.frame':   8 obs. of  3 variables:
##  $ title       : chr  "A Series of Unfortunate Events" "13 Reasons Why" "Gypsy" "Ozark" ...
##  $ episodes    : int  8 13 10 10 10 10 6 10
##  $ release_date: chr  "13-Jan-17" "31-Mar-17" "30-Jun-17" "21-Jul-17" ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(imdb)
## Classes 'data.table' and 'data.frame':   11 obs. of  2 variables:
##  $ title : chr  "The Orville" "Big Mouth" "The Gifted" "Gypsy" ...
##  $ rating: num  7.7 8.3 8.2 7 5.4 8.4 7.3 8.9 8.4 8.5 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Print the data.tables in your R session
netflix
##                             title episodes release_date
## 1: A Series of Unfortunate Events        8    13-Jan-17
## 2:                 13 Reasons Why       13    31-Mar-17
## 3:                          Gypsy       10    30-Jun-17
## 4:                          Ozark       10    21-Jul-17
## 5:                     Mindhunter       10    13-Oct-17
## 6:                       Longmire       10    17-Nov-17
## 7:                        Godless        6    22-Nov-17
## 8:                           Dark       10     1-Dec-17
imdb
##                                        title rating
##  1:                              The Orville    7.7
##  2:                                Big Mouth    8.3
##  3:                               The Gifted    8.2
##  4:                                    Gypsy    7.0
##  5:                                 Inhumans    5.4
##  6:                           13 Reasons Why    8.4
##  7:                     Star Trek: Discovery    7.3
##  8:                               Mindhunter    8.9
##  9: Mystery Science Theatre 3000: The Return    8.4
## 10:                                    Ozark    8.5
## 11:           A Series of Unfortunate Events    7.9
# Inner join netflix and imdb
merge(netflix, imdb, by = "title")
##                             title episodes release_date rating
## 1:                 13 Reasons Why       13    31-Mar-17    8.4
## 2: A Series of Unfortunate Events        8    13-Jan-17    7.9
## 3:                          Gypsy       10    30-Jun-17    7.0
## 4:                     Mindhunter       10    13-Oct-17    8.9
## 5:                          Ozark       10    21-Jul-17    8.5
# Full join netflix and imdb
merge(netflix, imdb, by = "title", all=TRUE)
##                                        title episodes release_date rating
##  1:                           13 Reasons Why       13    31-Mar-17    8.4
##  2:           A Series of Unfortunate Events        8    13-Jan-17    7.9
##  3:                                Big Mouth       NA         <NA>    8.3
##  4:                                     Dark       10     1-Dec-17     NA
##  5:                                  Godless        6    22-Nov-17     NA
##  6:                                    Gypsy       10    30-Jun-17    7.0
##  7:                                 Inhumans       NA         <NA>    5.4
##  8:                                 Longmire       10    17-Nov-17     NA
##  9:                               Mindhunter       10    13-Oct-17    8.9
## 10: Mystery Science Theatre 3000: The Return       NA         <NA>    8.4
## 11:                                    Ozark       10    21-Jul-17    8.5
## 12:                     Star Trek: Discovery       NA         <NA>    7.3
## 13:                               The Gifted       NA         <NA>    8.2
## 14:                              The Orville       NA         <NA>    7.7
# Full join imdb and netflix
merge(imdb, netflix, by = "title", all = TRUE)
##                                        title rating episodes release_date
##  1:                           13 Reasons Why    8.4       13    31-Mar-17
##  2:           A Series of Unfortunate Events    7.9        8    13-Jan-17
##  3:                                Big Mouth    8.3       NA         <NA>
##  4:                                     Dark     NA       10     1-Dec-17
##  5:                                  Godless     NA        6    22-Nov-17
##  6:                                    Gypsy    7.0       10    30-Jun-17
##  7:                                 Inhumans    5.4       NA         <NA>
##  8:                                 Longmire     NA       10    17-Nov-17
##  9:                               Mindhunter    8.9       10    13-Oct-17
## 10: Mystery Science Theatre 3000: The Return    8.4       NA         <NA>
## 11:                                    Ozark    8.5       10    21-Jul-17
## 12:                     Star Trek: Discovery    7.3       NA         <NA>
## 13:                               The Gifted    8.2       NA         <NA>
## 14:                              The Orville    7.7       NA         <NA>
# Left join imdb to netflix
merge(netflix, imdb, by="title", all.x=TRUE)
##                             title episodes release_date rating
## 1:                 13 Reasons Why       13    31-Mar-17    8.4
## 2: A Series of Unfortunate Events        8    13-Jan-17    7.9
## 3:                           Dark       10     1-Dec-17     NA
## 4:                        Godless        6    22-Nov-17     NA
## 5:                          Gypsy       10    30-Jun-17    7.0
## 6:                       Longmire       10    17-Nov-17     NA
## 7:                     Mindhunter       10    13-Oct-17    8.9
## 8:                          Ozark       10    21-Jul-17    8.5
# Right join imdb to netflix
merge(netflix, imdb, by="title", all.y=TRUE)
##                                        title episodes release_date rating
##  1:                           13 Reasons Why       13    31-Mar-17    8.4
##  2:           A Series of Unfortunate Events        8    13-Jan-17    7.9
##  3:                                Big Mouth       NA         <NA>    8.3
##  4:                                    Gypsy       10    30-Jun-17    7.0
##  5:                                 Inhumans       NA         <NA>    5.4
##  6:                               Mindhunter       10    13-Oct-17    8.9
##  7: Mystery Science Theatre 3000: The Return       NA         <NA>    8.4
##  8:                                    Ozark       10    21-Jul-17    8.5
##  9:                     Star Trek: Discovery       NA         <NA>    7.3
## 10:                               The Gifted       NA         <NA>    8.2
## 11:                              The Orville       NA         <NA>    7.7
# Compare to a left join of netflix to imdb
merge(imdb, netflix, by="title", all.x=TRUE)
##                                        title rating episodes release_date
##  1:                           13 Reasons Why    8.4       13    31-Mar-17
##  2:           A Series of Unfortunate Events    7.9        8    13-Jan-17
##  3:                                Big Mouth    8.3       NA         <NA>
##  4:                                    Gypsy    7.0       10    30-Jun-17
##  5:                                 Inhumans    5.4       NA         <NA>
##  6:                               Mindhunter    8.9       10    13-Oct-17
##  7: Mystery Science Theatre 3000: The Return    8.4       NA         <NA>
##  8:                                    Ozark    8.5       10    21-Jul-17
##  9:                     Star Trek: Discovery    7.3       NA         <NA>
## 10:                               The Gifted    8.2       NA         <NA>
## 11:                              The Orville    7.7       NA         <NA>
australia_area <- fread("./RInputFiles/australia_area.csv", sep=",")
australia_capitals <- fread("./RInputFiles/australia_capitals.csv", sep=",")
australia_cities_top20 <- fread("./RInputFiles/australia_cities_top20.csv", sep=",")


# Identify the key for joining capitals and population
capitals_population_key <- "city"

# Left join population to capitals
capital_pop <- merge(australia_capitals, 
                     australia_cities_top20[, c("city", "population")], 
                     by=capitals_population_key, all.x=TRUE
                     )
capital_pop
##                     city                        state   country population
## 1:              Adelaide              South Australia Australia    1324279
## 2:              Brisbane                   Queensland Australia    2360241
## 3: Canberraâ\200“Queanbeyan Australian Capital Territory Australia         NA
## 4:                Darwin           Northern Territory Australia     145916
## 5:                Hobart                     Tasmania Australia     224462
## 6:             Melbourne                     Victoria Australia    4725316
## 7:                 Perth            Western Australia Australia    2022044
## 8:                Sydney              New South Wales Australia    5029768
# Identify the key for joining capital_pop and area
capital_pop_area_key <- "state"

# Inner join area to capital pop
australia_stats <- merge(capital_pop, australia_area[, c("state", "area_km2")], by=capital_pop_area_key)

# Print the final result
australia_stats
##                           state                  city   country population
## 1: Australian Capital Territory Canberraâ\200“Queanbeyan Australia         NA
## 2:              New South Wales                Sydney Australia    5029768
## 3:           Northern Territory                Darwin Australia     145916
## 4:                   Queensland              Brisbane Australia    2360241
## 5:              South Australia              Adelaide Australia    1324279
## 6:                     Tasmania                Hobart Australia     224462
## 7:                     Victoria             Melbourne Australia    4725316
## 8:            Western Australia                 Perth Australia    2022044
##    area_km2
## 1:     2358
## 2:   800641
## 3:  1349129
## 4:  1730647
## 5:   983482
## 6:    68401
## 7:   227416
## 8:  2529875

Chapter 2 - Joins Using data.table Syntax

Joins using data.table syntax:

  • The general form of data.table syntax includes
    • DT[i, j, by] # grouped by “by”, action j is taken on rows i
    • DT[i, on] # this will merge data.table “i” to data.table “DT” by key “on”
    • demographics[shipping, on = .(name)]
  • Variables inside list() or .() are looked up in the column names of both data.tables
    • shipping[demographics, on = list(name)] # all records in demographics will be kept, plus records in shipping that have a match by on= to demographics
    • shipping[demographics, on = .(name)]
    • join_key <- c(“name”)
    • shipping[demographics, on = join_key]
  • For an inner join, supply nomatch=0L; full joins are not possible using the data.table syntax, so it is necessary to use merge() instead
    • shipping[demographics, on = .(name), nomatch = 0L]
  • The anti-join is possible using the negation operator
    • demographics[!shipping, on = .(name)] # all records in demographics that are NOT in shipping

Setting and viewing data.table keys:

  • With keys set, the on= argument is no longer needed for joins on that key
    • Setting a key also sorts the data.table by that key in memory (makes merge operations faster)
    • setkey(DT, …)
    • setkey(DT, key1, key2, key3)
    • setkey(DT, “key1”, “key2”, “key3”)
  • The setkeyv() function allows for passing in a character vector for the key column names
    • keys <- c(“key1”, “key2”, “key3”)
    • setkeyv(dt, keys)
  • Can check for keys and find their names if they exist
    • haskey(dt1)
    • key(dt1)

Incorporating joins in the data.table workflow:

  • Joins can be included in the data.table workflows, enabling rapid analysis
    • DT1[DT2, on][i, j, by] # join, followed by standard data.table operations
    • customers[purchases, on = .(name)][sales > 1, j = .(avg_spent = sum(spent) / sum(sales)), by = .(gender)]
  • Can also incorporate calculations and new column creations with joins as follows
    • DT1[DT2, on, j]
    • customers[purchases, on = .(name), return_customer := sales > 1]
    • DT1[DT2, on, j, by = .EACHI] # will have a groupby for each match in DT1
    • shipping[customers, on = .(name), j = .(“# of shipping addresses” = .N), by = .EACHI]
    • customers[shipping, on = .(name), .(avg_age = mean(age)), by = .(gender)]

Example code includes:

# Right join population to capitals using data.table syntax
australia_capitals[australia_cities_top20, on = "city"]
##                       city              state   country population
##  1:                 Sydney    New South Wales Australia    5029768
##  2:              Melbourne           Victoria Australia    4725316
##  3:               Brisbane         Queensland Australia    2360241
##  4:                  Perth  Western Australia Australia    2022044
##  5:               Adelaide    South Australia Australia    1324279
##  6: Gold Coast-Tweed Heads               <NA>      <NA>     646983
##  7:     Newcastle-Maitland               <NA>      <NA>     436171
##  8:    Canberra-Queanbeyan               <NA>      <NA>     435019
##  9:         Sunshine Coast               <NA>      <NA>     317404
## 10:             Wollongong               <NA>      <NA>     295669
## 11:                 Hobart           Tasmania Australia     224462
## 12:                Geelong               <NA>      <NA>     192393
## 13:             Townsville               <NA>      <NA>     178864
## 14:                 Cairns               <NA>      <NA>     150041
## 15:                 Darwin Northern Territory Australia     145916
## 16:              Toowoomba               <NA>      <NA>     114024
## 17:               Ballarat               <NA>      <NA>     101588
## 18:                Bendigo               <NA>      <NA>      95587
## 19:         Albury-Wodonga               <NA>      <NA>      90576
## 20:             Launceston               <NA>      <NA>      86335
##     percentage
##  1:     0.2074
##  2:     0.1924
##  3:     0.0974
##  4:     0.0856
##  5:     0.0550
##  6:     0.0264
##  7:     0.0182
##  8:     0.0178
##  9:     0.0127
## 10:     0.0123
## 11:     0.0092
## 12:     0.0079
## 13:     0.0076
## 14:     0.0062
## 15:     0.0060
## 16:     0.0048
## 17:     0.0042
## 18:     0.0039
## 19:     0.0037
## 20:     0.0036
# Right join using merge
merge(australia_capitals, australia_cities_top20, by = "city", all.y = TRUE)
##                       city              state   country population
##  1:               Adelaide    South Australia Australia    1324279
##  2:         Albury-Wodonga               <NA>      <NA>      90576
##  3:               Ballarat               <NA>      <NA>     101588
##  4:                Bendigo               <NA>      <NA>      95587
##  5:               Brisbane         Queensland Australia    2360241
##  6:                 Cairns               <NA>      <NA>     150041
##  7:    Canberra-Queanbeyan               <NA>      <NA>     435019
##  8:                 Darwin Northern Territory Australia     145916
##  9:                Geelong               <NA>      <NA>     192393
## 10: Gold Coast-Tweed Heads               <NA>      <NA>     646983
## 11:                 Hobart           Tasmania Australia     224462
## 12:             Launceston               <NA>      <NA>      86335
## 13:              Melbourne           Victoria Australia    4725316
## 14:     Newcastle-Maitland               <NA>      <NA>     436171
## 15:                  Perth  Western Australia Australia    2022044
## 16:         Sunshine Coast               <NA>      <NA>     317404
## 17:                 Sydney    New South Wales Australia    5029768
## 18:              Toowoomba               <NA>      <NA>     114024
## 19:             Townsville               <NA>      <NA>     178864
## 20:             Wollongong               <NA>      <NA>     295669
##     percentage
##  1:     0.0550
##  2:     0.0037
##  3:     0.0042
##  4:     0.0039
##  5:     0.0974
##  6:     0.0062
##  7:     0.0178
##  8:     0.0060
##  9:     0.0079
## 10:     0.0264
## 11:     0.0092
## 12:     0.0036
## 13:     0.1924
## 14:     0.0182
## 15:     0.0856
## 16:     0.0127
## 17:     0.2074
## 18:     0.0048
## 19:     0.0076
## 20:     0.0123
# Inner join with the data.table syntax
australia_capitals[australia_cities_top20, on="city", nomatch=0L]
##         city              state   country population percentage
## 1:    Sydney    New South Wales Australia    5029768     0.2074
## 2: Melbourne           Victoria Australia    4725316     0.1924
## 3:  Brisbane         Queensland Australia    2360241     0.0974
## 4:     Perth  Western Australia Australia    2022044     0.0856
## 5:  Adelaide    South Australia Australia    1324279     0.0550
## 6:    Hobart           Tasmania Australia     224462     0.0092
## 7:    Darwin Northern Territory Australia     145916     0.0060
# Anti-join capitals to population
australia_cities_top20[!australia_capitals, on="city"]
##                       city population percentage
##  1: Gold Coast-Tweed Heads     646983     0.0264
##  2:     Newcastle-Maitland     436171     0.0182
##  3:    Canberra-Queanbeyan     435019     0.0178
##  4:         Sunshine Coast     317404     0.0127
##  5:             Wollongong     295669     0.0123
##  6:                Geelong     192393     0.0079
##  7:             Townsville     178864     0.0076
##  8:                 Cairns     150041     0.0062
##  9:              Toowoomba     114024     0.0048
## 10:               Ballarat     101588     0.0042
## 11:                Bendigo      95587     0.0039
## 12:         Albury-Wodonga      90576     0.0037
## 13:             Launceston      86335     0.0036
# Anti-join capitals to area
australia_area[!australia_capitals, on="state"]
##                                state area_km2 percentage
## 1:    Australian Antarctic Territory  5896500         NA
## 2: Heard Island and McDonald Islands      372          0
## 3:       Ashmore and Cartier Islands      199          0
## 4:                  Christmas Island      135          0
## 5:              Jervis Bay Territory       73          0
## 6:                    Norfolk Island       35          0
## 7:           Cocos (Keeling) Islands       14          0
## 8:                 Coral Sea Islands       10          0
# Set the keys
setkey(netflix, "title")
setkey(imdb, "title")

# Inner join
netflix[imdb, nomatch=0L]
##                             title episodes release_date rating
## 1:                 13 Reasons Why       13    31-Mar-17    8.4
## 2: A Series of Unfortunate Events        8    13-Jan-17    7.9
## 3:                          Gypsy       10    30-Jun-17    7.0
## 4:                     Mindhunter       10    13-Oct-17    8.9
## 5:                          Ozark       10    21-Jul-17    8.5
# Check for keys
haskey(netflix)
## [1] TRUE
haskey(imdb)
## [1] TRUE
# Find the key
the_key <- "title"

# Set the key for the other data.table
setkeyv(imdb, the_key)


# Inner join capitals to population
australia_cities_top20[australia_capitals, on="city", nomatch=0L]
##         city population percentage              state   country
## 1:    Sydney    5029768     0.2074    New South Wales Australia
## 2: Melbourne    4725316     0.1924           Victoria Australia
## 3:  Brisbane    2360241     0.0974         Queensland Australia
## 4:     Perth    2022044     0.0856  Western Australia Australia
## 5:  Adelaide    1324279     0.0550    South Australia Australia
## 6:    Hobart     224462     0.0092           Tasmania Australia
## 7:    Darwin     145916     0.0060 Northern Territory Australia
# Join and sum
australia_cities_top20[australia_capitals, on = .(city), nomatch = 0, j = sum(percentage)]
## [1] 0.653
continents <- fread("./RInputFiles/continents.csv", sep=",")
life_exp <- fread("./RInputFiles/gapminder_life_expectancy_2010.csv", sep=",")
life_exp <- life_exp %>% rename(years = life_expectancy)
str(continents)
## Classes 'data.table' and 'data.frame':   235 obs. of  2 variables:
##  $ continent: chr  "africa" "africa" "africa" "africa" ...
##  $ country  : chr  "Algeria" "Angola" "Benin" "Botswana" ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(life_exp)
## Classes 'data.table' and 'data.frame':   208 obs. of  2 variables:
##  $ country: chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ years  : num  53.6 77.2 76 72.8 84.7 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# What countries are listed in multiple continents?
continents[life_exp, on = .(country), .N, by = .EACHI][N > 1]
##       country N
## 1:    Armenia 2
## 2: Azerbaijan 2
## 3:     Cyprus 2
## 4:    Georgia 2
## 5: Kazakhstan 2
## 6:     Russia 2
## 7:     Turkey 2
# Calculate average life expectancy per continent:
avg_life_expectancy <- continents[life_exp, on = .(country), nomatch=0L][, j = mean(years), by = continent]
avg_life_expectancy
##        continent       V1
## 1:          asia 73.27039
## 2:        europe 77.43404
## 3:        africa 61.00556
## 4: north_america 73.73964
## 5: south_america 74.41000
## 6:       oceania 69.56077

Chapter 3 - Diagnosing and Fixing Common Join Problems

Complex keys:

  • A misspecified join is when an incorrect join key has been used
  • A malformed join is when they keys have no values in common (stacked join)
  • Best practice is to study the data in each column prior to attempting a join
  • When the keys have different names, can manage the merge with code
    • merge(customers, web_visits, by.x = “name”, by.y = “person”)
    • customers[web_visits, on = .(name = person)]
    • customers[web_visits, on = c(“name” = “person”)]
  • Can also merge with multiple keys
    • merge(purchases, web_visits, by = c(“name”, “date”))
    • merge(purchases, web_visits, by.x = c(“name”, “date”), by.y = c(“person”, “date”) # matches in the order that the appear
    • purchases[web_visits, on = c(“name” = “person”, “date”)]

Tricky columns:

  • Tables sometimes share column names that are not intended as join keys
  • With the data.table syntax, duplicate names from the i= area have i. as a prefix
    • By contrast, with merge() duplicate names may have a .x and .y suffix
  • Can be helpful instead to rename columns prior to the join using setnames()
    • setnames(parents, c(“parent”, “parent.gender”, “parent.age”))
  • May want to join a data.frame and data.table, but rownames is the key for the data.table
    • parents <- as.data.table(parents, keep.rownames = “parent”)

Duplicate matches:

  • May want to join columns that have many-many on the keys
    • This will throw a long error message, and requires setting parameters to show this is intentional
    • Note that NA will match all other NA; behavior can be over-ridden by filtering out the NA
    • site2_ecology <- site2_ecology[!is.na(genus)]
  • Can keep only the first match or only the last match
    • site1_ecology[site2_ecology, on = .(genus), mult = “first”]
    • children[parents, on = .(parent = name), mult = “last”]
  • Can find the duplicated values
    • duplicated(site1_ecology) # boolean
    • duplicated(site1_ecology, by = “genus”) # boolean, lookin just at column genus
    • unique(site1_ecology, by = “genus”) # removes the duplicates
    • duplicated(site1_ecology, by = “genus”, fromLast = TRUE)
    • unique(site1_ecology, by = “genus”, fromLast = TRUE)

Example code includes:

guardians <- fread("./RInputFiles/school_db_guardians.tsv")
locations <- fread("./RInputFiles/school_db_locations.tsv")
students <- fread("./RInputFiles/school_db_students.tsv")
subjects <- fread("./RInputFiles/school_db_subjects.tsv")
teachers <- fread("./RInputFiles/school_db_teachers.tsv")


# Full join
merge(students, guardians, by="name", all=TRUE)
##        name sex.x age.x guardian sex.y age.y          phone
##  1:  Aciano  <NA>    NA     <NA>     M    44 0163-680-95557
##  2:   Adara     F    16    Kiana  <NA>    NA           <NA>
##  3:   Caleb     M    15    Tyler  <NA>    NA           <NA>
##  4:  Cierra     F    17    Kiana  <NA>    NA           <NA>
##  5:    Elsa     F    17     John  <NA>    NA           <NA>
##  6:    John  <NA>    NA     <NA>     M    34 0163-745-07369
##  7:  Kalvin     M    17     John  <NA>    NA           <NA>
##  8:   Kiana  <NA>    NA     <NA>     F    38 0163-875-41705
##  9: Makaela     F    17   Nicole  <NA>    NA           <NA>
## 10:  Nicole  <NA>    NA     <NA>     F    26 0163-266-89055
## 11:   Tyler  <NA>    NA     <NA>     M    48 0165-526-80087
## 12:  Yelena     F    17   Aciano  <NA>    NA           <NA>
students[guardians, on="name"]
##      name  sex age guardian i.sex i.age          phone
## 1:   John <NA>  NA     <NA>     M    34 0163-745-07369
## 2:  Kiana <NA>  NA     <NA>     F    38 0163-875-41705
## 3:  Tyler <NA>  NA     <NA>     M    48 0165-526-80087
## 4: Nicole <NA>  NA     <NA>     F    26 0163-266-89055
## 5: Aciano <NA>  NA     <NA>     M    44 0163-680-95557
# Change the code to an inner join
students[guardians, on = .(name), nomatch=0L]
## Empty data.table (0 rows and 7 cols): name,sex,age,guardian,i.sex,i.age...
# What are the correct join key columns?
students[guardians, on = c("guardian"="name"), nomatch = 0L]
##       name sex age guardian i.sex i.age          phone
## 1:  Kalvin   M  17     John     M    34 0163-745-07369
## 2:    Elsa   F  17     John     M    34 0163-745-07369
## 3:   Adara   F  16    Kiana     F    38 0163-875-41705
## 4:  Cierra   F  17    Kiana     F    38 0163-875-41705
## 5:   Caleb   M  15    Tyler     M    48 0165-526-80087
## 6: Makaela   F  17   Nicole     F    26 0163-266-89055
## 7:  Yelena   F  17   Aciano     M    44 0163-680-95557
# Intentionally errors out due to type mismatch
# subjects[locations, on=c("class", "semester")]

# Structure 
str(subjects)
## Classes 'data.table' and 'data.frame':   28 obs. of  3 variables:
##  $ name    : chr  "Yelena" "Yelena" "Yelena" "Yelena" ...
##  $ semester: int  1 1 2 2 1 1 2 2 1 1 ...
##  $ class   : chr  "Mathematics" "Programming" "Language" "Art" ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(locations)
## Classes 'data.table' and 'data.frame':   16 obs. of  4 variables:
##  $ class   : chr  "English" "Mathematics" "Art" "Programming" ...
##  $ semester: int  1 1 1 1 1 1 1 1 2 2 ...
##  $ building: chr  "Block B" "Block C" "Block B" "Block A" ...
##  $ room    : chr  "Room 103" "Room 104" "Room 102" "Room 102" ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Does semester have the same class? 
same_class <- FALSE

# Fix the column class
locations[, semester := as.integer(semester)]

# Right join
subjects[locations, on=c("class", "semester")]
##        name semester       class building     room
##  1:   Adara        1     English  Block B Room 103
##  2:  Cierra        1     English  Block B Room 103
##  3:  Yelena        1 Mathematics  Block C Room 104
##  4:    Elsa        1 Mathematics  Block C Room 104
##  5:    <NA>        1         Art  Block B Room 102
##  6:  Yelena        1 Programming  Block A Room 102
##  7: Makaela        1 Programming  Block A Room 102
##  8:   Caleb        1 Programming  Block A Room 102
##  9:  Kalvin        1     History  Block B Room 101
## 10:   Adara        1   Geography  Block A Room 101
## 11:   Caleb        1   Geography  Block A Room 101
## 12:    Elsa        1    Politics  Block A Room 104
## 13: Makaela        1    Politics  Block A Room 104
## 14:  Kalvin        1    Language  Block C Room 103
## 15:  Cierra        1    Language  Block C Room 103
## 16:    <NA>        2     English  Block A Room 104
## 17:   Caleb        2 Mathematics  Block A Room 102
## 18:  Yelena        2         Art  Block B Room 102
## 19:  Cierra        2         Art  Block B Room 102
## 20: Makaela        2 Programming  Block C Room 104
## 21:    Elsa        2     History  Block A Room 103
## 22:   Adara        2   Geography  Block C Room 103
## 23:   Caleb        2   Geography  Block C Room 103
## 24:  Kalvin        2    Politics  Block C Room 102
## 25:    Elsa        2    Politics  Block C Room 102
## 26:   Adara        2    Politics  Block C Room 102
## 27:  Cierra        2    Politics  Block C Room 102
## 28:  Yelena        2    Language  Block B Room 103
## 29:  Kalvin        2    Language  Block B Room 103
## 30: Makaela        2    Language  Block B Room 103
##        name semester       class building     room
# Identify and set the keys
join_key <- c("subject"="class")

# Right join
teachers[locations, on=join_key]
##          teacher     subject semester building     room
##  1:  Mr. Marquez     English        1  Block B Room 103
##  2:   Ms. Schoon Mathematics        1  Block C Room 104
##  3:   Ms. Harris         Art        1  Block B Room 102
##  4:   Ms. Homann Programming        1  Block A Room 102
##  5:  Mr. Santoyo     History        1  Block B Room 101
##  6: Mr. Carbajal   Geography        1  Block A Room 101
##  7: Ms. Limitone    Politics        1  Block A Room 104
##  8:      Ms. Low    Language        1  Block C Room 103
##  9:  Mr. Marquez     English        2  Block A Room 104
## 10:   Ms. Schoon Mathematics        2  Block A Room 102
## 11:   Ms. Harris         Art        2  Block B Room 102
## 12:   Ms. Homann Programming        2  Block C Room 104
## 13:  Mr. Santoyo     History        2  Block A Room 103
## 14: Mr. Carbajal   Geography        2  Block C Room 103
## 15: Ms. Limitone    Politics        2  Block C Room 102
## 16:      Ms. Low    Language        2  Block B Room 103
# Inner join 1
capital_pop <- merge(australia_capitals, australia_cities_top20, by="city", nomatch=0L)

# Inner join 2
merge(capital_pop, australia_area, by="state", suffixes=c(".pop", ".area"), nomatch=0L)
##                 state      city   country population percentage.pop
## 1:    New South Wales    Sydney Australia    5029768         0.2074
## 2: Northern Territory    Darwin Australia     145916         0.0060
## 3:         Queensland  Brisbane Australia    2360241         0.0974
## 4:    South Australia  Adelaide Australia    1324279         0.0550
## 5:           Tasmania    Hobart Australia     224462         0.0092
## 6:           Victoria Melbourne Australia    4725316         0.1924
## 7:  Western Australia     Perth Australia    2022044         0.0856
##    area_km2 percentage.area
## 1:   800641          0.1041
## 2:  1349129          0.1754
## 3:  1730647          0.2250
## 4:   983482          0.1279
## 5:    68401          0.0089
## 6:   227416          0.0296
## 7:  2529875          0.3289
netflixOrig <- fread("./RInputFiles/netflix_2017.csv", sep=",")
imdb <- fread("./RInputFiles/imdb_ratings.csv", sep=",")
netflix <- as.data.frame(netflixOrig)[, c("episodes", "release_date")]
rownames(netflix) <- netflixOrig$title

# Convert netflix to a data.table
netflix <- as.data.table(netflix, keep.rownames="series")

# Rename "title" to "series" in imdb
setnames(imdb, c("series", "rating"))

# Right join
imdb[netflix, on="series"]
##                            series rating episodes release_date
## 1: A Series of Unfortunate Events    7.9        8    13-Jan-17
## 2:                 13 Reasons Why    8.4       13    31-Mar-17
## 3:                          Gypsy    7.0       10    30-Jun-17
## 4:                          Ozark    8.5       10    21-Jul-17
## 5:                     Mindhunter    8.9       10    13-Oct-17
## 6:                       Longmire     NA       10    17-Nov-17
## 7:                        Godless     NA        6    22-Nov-17
## 8:                           Dark     NA       10     1-Dec-17
cardio <- fread("./RInputFiles/affymetrix_chd_genes.csv")
framingham <- fread("./RInputFiles/framingham_chd_genes.csv")
heart <- fread("./RInputFiles/illumina_chd_genes.csv")


# Try an inner join
merge(heart, cardio, by=c("gene"), allow.cartesian=TRUE)
##        gene   ilmn_probe change.x pvalue.x       affy_probe change.y
##  1:         ILMN_1772594     1.26  8.7e-06 PSR06035179.hg.1     1.16
##  2:         ILMN_1772594     1.26  8.7e-06 PSR01070675.hg.1     1.10
##  3:         ILMN_1772594     1.26  8.7e-06 JUC05011543.hg.1     1.09
##  4:         ILMN_1772594     1.26  8.7e-06 PSR08024619.hg.1     1.05
##  5:         ILMN_1772594     1.26  8.7e-06 PSR20002720.hg.1     1.04
##  6:         ILMN_3206475     1.16  3.3e-05 PSR06035179.hg.1     1.16
##  7:         ILMN_3206475     1.16  3.3e-05 PSR01070675.hg.1     1.10
##  8:         ILMN_3206475     1.16  3.3e-05 JUC05011543.hg.1     1.09
##  9:         ILMN_3206475     1.16  3.3e-05 PSR08024619.hg.1     1.05
## 10:         ILMN_3206475     1.16  3.3e-05 PSR20002720.hg.1     1.04
## 11:         ILMN_1689153     1.11  6.4e-05 PSR06035179.hg.1     1.16
## 12:         ILMN_1689153     1.11  6.4e-05 PSR01070675.hg.1     1.10
## 13:         ILMN_1689153     1.11  6.4e-05 JUC05011543.hg.1     1.09
## 14:         ILMN_1689153     1.11  6.4e-05 PSR08024619.hg.1     1.05
## 15:         ILMN_1689153     1.11  6.4e-05 PSR20002720.hg.1     1.04
## 16:         ILMN_3282983     1.10  7.3e-05 PSR06035179.hg.1     1.16
## 17:         ILMN_3282983     1.10  7.3e-05 PSR01070675.hg.1     1.10
## 18:         ILMN_3282983     1.10  7.3e-05 JUC05011543.hg.1     1.09
## 19:         ILMN_3282983     1.10  7.3e-05 PSR08024619.hg.1     1.05
## 20:         ILMN_3282983     1.10  7.3e-05 PSR20002720.hg.1     1.04
## 21:         ILMN_1708533    -1.07  1.1e-04 PSR06035179.hg.1     1.16
## 22:         ILMN_1708533    -1.07  1.1e-04 PSR01070675.hg.1     1.10
## 23:         ILMN_1708533    -1.07  1.1e-04 JUC05011543.hg.1     1.09
## 24:         ILMN_1708533    -1.07  1.1e-04 PSR08024619.hg.1     1.05
## 25:         ILMN_1708533    -1.07  1.1e-04 PSR20002720.hg.1     1.04
## 26:         ILMN_1723249     1.06  1.3e-04 PSR06035179.hg.1     1.16
## 27:         ILMN_1723249     1.06  1.3e-04 PSR01070675.hg.1     1.10
## 28:         ILMN_1723249     1.06  1.3e-04 JUC05011543.hg.1     1.09
## 29:         ILMN_1723249     1.06  1.3e-04 PSR08024619.hg.1     1.05
## 30:         ILMN_1723249     1.06  1.3e-04 PSR20002720.hg.1     1.04
## 31:         ILMN_3206111     1.06  1.4e-04 PSR06035179.hg.1     1.16
## 32:         ILMN_3206111     1.06  1.4e-04 PSR01070675.hg.1     1.10
## 33:         ILMN_3206111     1.06  1.4e-04 JUC05011543.hg.1     1.09
## 34:         ILMN_3206111     1.06  1.4e-04 PSR08024619.hg.1     1.05
## 35:         ILMN_3206111     1.06  1.4e-04 PSR20002720.hg.1     1.04
## 36:  ABCB10 ILMN_1665730     1.05  1.4e-04 JUC01031346.hg.1    -1.11
## 37:  ABCB10 ILMN_1665730     1.05  1.4e-04 JUC01031349.hg.1    -1.08
## 38:  ABCB10 ILMN_1665730     1.05  1.4e-04 JUC01031353.hg.1     1.07
## 39:  ABCB10 ILMN_1665730     1.05  1.4e-04 PSR01059893.hg.1     1.04
## 40:    CIB2 ILMN_1714489     1.19  2.4e-05 PSR15016581.hg.1    -1.05
## 41:   FABP4 ILMN_1773006    -1.09  9.2e-05 PSR08017600.hg.1     1.22
## 42:   FABP4 ILMN_1773006    -1.09  9.2e-05 PSR08017606.hg.1     1.20
## 43: FOXD4L4 ILMN_3238576     1.10  7.6e-05 PSR09002806.hg.1    -1.15
## 44: FOXD4L4 ILMN_3238576     1.10  7.6e-05 PSR09002801.hg.1    -1.11
## 45: FOXD4L4 ILMN_3238576     1.10  7.6e-05 PSR09015126.hg.1     1.07
## 46: FOXD4L4 ILMN_3238576     1.06  1.2e-04 PSR09002806.hg.1    -1.15
## 47: FOXD4L4 ILMN_3238576     1.06  1.2e-04 PSR09002801.hg.1    -1.11
## 48: FOXD4L4 ILMN_3238576     1.06  1.2e-04 PSR09015126.hg.1     1.07
## 49:    GYPB ILMN_1683093     1.15  4.1e-05 PSR04029411.hg.1     1.13
## 50:    GYPB ILMN_1683093     1.15  4.1e-05 PSR04029413.hg.1     1.12
## 51:    GZMB ILMN_2109489     1.10  8.0e-05 PSR14011271.hg.1     1.08
## 52:    ISCU ILMN_2409062     1.11  6.4e-05 JUC12005887.hg.1    -1.09
## 53:    ISCU ILMN_2409062     1.11  6.4e-05 JUC12005889.hg.1     1.05
## 54:    ISCU ILMN_1735432     1.08  1.0e-04 JUC12005887.hg.1    -1.09
## 55:    ISCU ILMN_1735432     1.08  1.0e-04 JUC12005889.hg.1     1.05
## 56:  SEPHS1 ILMN_2176768    -1.16  3.3e-05 PSR10013017.hg.1    -1.18
## 57:  SEPHS1 ILMN_2176768    -1.16  3.3e-05 JUC10007537.hg.1     1.12
## 58:  SEPHS1 ILMN_2176768    -1.16  3.3e-05 JUC10007538.hg.1    -1.11
## 59:  SEPHS1 ILMN_1673369     1.09  8.2e-05 PSR10013017.hg.1    -1.18
## 60:  SEPHS1 ILMN_1673369     1.09  8.2e-05 JUC10007537.hg.1     1.12
## 61:  SEPHS1 ILMN_1673369     1.09  8.2e-05 JUC10007538.hg.1    -1.11
## 62:   TSTA3 ILMN_1697777     1.09  9.1e-05 JUC08011706.hg.1    -1.14
## 63:  ZDHHC2 ILMN_1769783     1.04  1.6e-04 JUC08000597.hg.1     1.12
##        gene   ilmn_probe change.x pvalue.x       affy_probe change.y
##     pvalue.y
##  1:  3.4e-05
##  2:  7.2e-05
##  3:  8.3e-05
##  4:  1.5e-04
##  5:  1.6e-04
##  6:  3.4e-05
##  7:  7.2e-05
##  8:  8.3e-05
##  9:  1.5e-04
## 10:  1.6e-04
## 11:  3.4e-05
## 12:  7.2e-05
## 13:  8.3e-05
## 14:  1.5e-04
## 15:  1.6e-04
## 16:  3.4e-05
## 17:  7.2e-05
## 18:  8.3e-05
## 19:  1.5e-04
## 20:  1.6e-04
## 21:  3.4e-05
## 22:  7.2e-05
## 23:  8.3e-05
## 24:  1.5e-04
## 25:  1.6e-04
## 26:  3.4e-05
## 27:  7.2e-05
## 28:  8.3e-05
## 29:  1.5e-04
## 30:  1.6e-04
## 31:  3.4e-05
## 32:  7.2e-05
## 33:  8.3e-05
## 34:  1.5e-04
## 35:  1.6e-04
## 36:  6.4e-05
## 37:  9.9e-05
## 38:  1.1e-04
## 39:  1.6e-04
## 40:  1.5e-04
## 41:  1.6e-05
## 42:  2.0e-05
## 43:  3.9e-05
## 44:  6.4e-05
## 45:  1.1e-04
## 46:  3.9e-05
## 47:  6.4e-05
## 48:  1.1e-04
## 49:  5.1e-05
## 50:  6.1e-05
## 51:  1.0e-04
## 52:  8.3e-05
## 53:  1.5e-04
## 54:  8.3e-05
## 55:  1.5e-04
## 56:  2.6e-05
## 57:  6.0e-05
## 58:  6.5e-05
## 59:  2.6e-05
## 60:  6.0e-05
## 61:  6.5e-05
## 62:  4.2e-05
## 63:  5.9e-05
##     pvalue.y
# Filter missing values
heart_2 <- heart[!is.na(gene)]
cardio_2 <- cardio[!is.na(gene)]

# Repeat the inner join
merge(heart_2, cardio_2, by=c("gene"), allow.cartesian=TRUE)
##        gene   ilmn_probe change.x pvalue.x       affy_probe change.y
##  1:         ILMN_1772594     1.26  8.7e-06 PSR06035179.hg.1     1.16
##  2:         ILMN_1772594     1.26  8.7e-06 PSR01070675.hg.1     1.10
##  3:         ILMN_1772594     1.26  8.7e-06 JUC05011543.hg.1     1.09
##  4:         ILMN_1772594     1.26  8.7e-06 PSR08024619.hg.1     1.05
##  5:         ILMN_1772594     1.26  8.7e-06 PSR20002720.hg.1     1.04
##  6:         ILMN_3206475     1.16  3.3e-05 PSR06035179.hg.1     1.16
##  7:         ILMN_3206475     1.16  3.3e-05 PSR01070675.hg.1     1.10
##  8:         ILMN_3206475     1.16  3.3e-05 JUC05011543.hg.1     1.09
##  9:         ILMN_3206475     1.16  3.3e-05 PSR08024619.hg.1     1.05
## 10:         ILMN_3206475     1.16  3.3e-05 PSR20002720.hg.1     1.04
## 11:         ILMN_1689153     1.11  6.4e-05 PSR06035179.hg.1     1.16
## 12:         ILMN_1689153     1.11  6.4e-05 PSR01070675.hg.1     1.10
## 13:         ILMN_1689153     1.11  6.4e-05 JUC05011543.hg.1     1.09
## 14:         ILMN_1689153     1.11  6.4e-05 PSR08024619.hg.1     1.05
## 15:         ILMN_1689153     1.11  6.4e-05 PSR20002720.hg.1     1.04
## 16:         ILMN_3282983     1.10  7.3e-05 PSR06035179.hg.1     1.16
## 17:         ILMN_3282983     1.10  7.3e-05 PSR01070675.hg.1     1.10
## 18:         ILMN_3282983     1.10  7.3e-05 JUC05011543.hg.1     1.09
## 19:         ILMN_3282983     1.10  7.3e-05 PSR08024619.hg.1     1.05
## 20:         ILMN_3282983     1.10  7.3e-05 PSR20002720.hg.1     1.04
## 21:         ILMN_1708533    -1.07  1.1e-04 PSR06035179.hg.1     1.16
## 22:         ILMN_1708533    -1.07  1.1e-04 PSR01070675.hg.1     1.10
## 23:         ILMN_1708533    -1.07  1.1e-04 JUC05011543.hg.1     1.09
## 24:         ILMN_1708533    -1.07  1.1e-04 PSR08024619.hg.1     1.05
## 25:         ILMN_1708533    -1.07  1.1e-04 PSR20002720.hg.1     1.04
## 26:         ILMN_1723249     1.06  1.3e-04 PSR06035179.hg.1     1.16
## 27:         ILMN_1723249     1.06  1.3e-04 PSR01070675.hg.1     1.10
## 28:         ILMN_1723249     1.06  1.3e-04 JUC05011543.hg.1     1.09
## 29:         ILMN_1723249     1.06  1.3e-04 PSR08024619.hg.1     1.05
## 30:         ILMN_1723249     1.06  1.3e-04 PSR20002720.hg.1     1.04
## 31:         ILMN_3206111     1.06  1.4e-04 PSR06035179.hg.1     1.16
## 32:         ILMN_3206111     1.06  1.4e-04 PSR01070675.hg.1     1.10
## 33:         ILMN_3206111     1.06  1.4e-04 JUC05011543.hg.1     1.09
## 34:         ILMN_3206111     1.06  1.4e-04 PSR08024619.hg.1     1.05
## 35:         ILMN_3206111     1.06  1.4e-04 PSR20002720.hg.1     1.04
## 36:  ABCB10 ILMN_1665730     1.05  1.4e-04 JUC01031346.hg.1    -1.11
## 37:  ABCB10 ILMN_1665730     1.05  1.4e-04 JUC01031349.hg.1    -1.08
## 38:  ABCB10 ILMN_1665730     1.05  1.4e-04 JUC01031353.hg.1     1.07
## 39:  ABCB10 ILMN_1665730     1.05  1.4e-04 PSR01059893.hg.1     1.04
## 40:    CIB2 ILMN_1714489     1.19  2.4e-05 PSR15016581.hg.1    -1.05
## 41:   FABP4 ILMN_1773006    -1.09  9.2e-05 PSR08017600.hg.1     1.22
## 42:   FABP4 ILMN_1773006    -1.09  9.2e-05 PSR08017606.hg.1     1.20
## 43: FOXD4L4 ILMN_3238576     1.10  7.6e-05 PSR09002806.hg.1    -1.15
## 44: FOXD4L4 ILMN_3238576     1.10  7.6e-05 PSR09002801.hg.1    -1.11
## 45: FOXD4L4 ILMN_3238576     1.10  7.6e-05 PSR09015126.hg.1     1.07
## 46: FOXD4L4 ILMN_3238576     1.06  1.2e-04 PSR09002806.hg.1    -1.15
## 47: FOXD4L4 ILMN_3238576     1.06  1.2e-04 PSR09002801.hg.1    -1.11
## 48: FOXD4L4 ILMN_3238576     1.06  1.2e-04 PSR09015126.hg.1     1.07
## 49:    GYPB ILMN_1683093     1.15  4.1e-05 PSR04029411.hg.1     1.13
## 50:    GYPB ILMN_1683093     1.15  4.1e-05 PSR04029413.hg.1     1.12
## 51:    GZMB ILMN_2109489     1.10  8.0e-05 PSR14011271.hg.1     1.08
## 52:    ISCU ILMN_2409062     1.11  6.4e-05 JUC12005887.hg.1    -1.09
## 53:    ISCU ILMN_2409062     1.11  6.4e-05 JUC12005889.hg.1     1.05
## 54:    ISCU ILMN_1735432     1.08  1.0e-04 JUC12005887.hg.1    -1.09
## 55:    ISCU ILMN_1735432     1.08  1.0e-04 JUC12005889.hg.1     1.05
## 56:  SEPHS1 ILMN_2176768    -1.16  3.3e-05 PSR10013017.hg.1    -1.18
## 57:  SEPHS1 ILMN_2176768    -1.16  3.3e-05 JUC10007537.hg.1     1.12
## 58:  SEPHS1 ILMN_2176768    -1.16  3.3e-05 JUC10007538.hg.1    -1.11
## 59:  SEPHS1 ILMN_1673369     1.09  8.2e-05 PSR10013017.hg.1    -1.18
## 60:  SEPHS1 ILMN_1673369     1.09  8.2e-05 JUC10007537.hg.1     1.12
## 61:  SEPHS1 ILMN_1673369     1.09  8.2e-05 JUC10007538.hg.1    -1.11
## 62:   TSTA3 ILMN_1697777     1.09  9.1e-05 JUC08011706.hg.1    -1.14
## 63:  ZDHHC2 ILMN_1769783     1.04  1.6e-04 JUC08000597.hg.1     1.12
##        gene   ilmn_probe change.x pvalue.x       affy_probe change.y
##     pvalue.y
##  1:  3.4e-05
##  2:  7.2e-05
##  3:  8.3e-05
##  4:  1.5e-04
##  5:  1.6e-04
##  6:  3.4e-05
##  7:  7.2e-05
##  8:  8.3e-05
##  9:  1.5e-04
## 10:  1.6e-04
## 11:  3.4e-05
## 12:  7.2e-05
## 13:  8.3e-05
## 14:  1.5e-04
## 15:  1.6e-04
## 16:  3.4e-05
## 17:  7.2e-05
## 18:  8.3e-05
## 19:  1.5e-04
## 20:  1.6e-04
## 21:  3.4e-05
## 22:  7.2e-05
## 23:  8.3e-05
## 24:  1.5e-04
## 25:  1.6e-04
## 26:  3.4e-05
## 27:  7.2e-05
## 28:  8.3e-05
## 29:  1.5e-04
## 30:  1.6e-04
## 31:  3.4e-05
## 32:  7.2e-05
## 33:  8.3e-05
## 34:  1.5e-04
## 35:  1.6e-04
## 36:  6.4e-05
## 37:  9.9e-05
## 38:  1.1e-04
## 39:  1.6e-04
## 40:  1.5e-04
## 41:  1.6e-05
## 42:  2.0e-05
## 43:  3.9e-05
## 44:  6.4e-05
## 45:  1.1e-04
## 46:  3.9e-05
## 47:  6.4e-05
## 48:  1.1e-04
## 49:  5.1e-05
## 50:  6.1e-05
## 51:  1.0e-04
## 52:  8.3e-05
## 53:  1.5e-04
## 54:  8.3e-05
## 55:  1.5e-04
## 56:  2.6e-05
## 57:  6.0e-05
## 58:  6.5e-05
## 59:  2.6e-05
## 60:  6.0e-05
## 61:  6.5e-05
## 62:  4.2e-05
## 63:  5.9e-05
##     pvalue.y
# Keep only the last probe for each gene
heart_3 <- unique(heart_2, by="gene", fromLast=TRUE)
cardio_3 <- unique(cardio_2, by="gene", fromLast=TRUE)

# Inner join
reproducible <- merge(heart_3, cardio_3, by="gene", suffixes=c(".heart", ".cardio"))
reproducible
##        gene   ilmn_probe change.heart pvalue.heart       affy_probe
##  1:         ILMN_3206111         1.06      1.4e-04 PSR20002720.hg.1
##  2:  ABCB10 ILMN_1665730         1.05      1.4e-04 PSR01059893.hg.1
##  3:    CIB2 ILMN_1714489         1.19      2.4e-05 PSR15016581.hg.1
##  4:   FABP4 ILMN_1773006        -1.09      9.2e-05 PSR08017606.hg.1
##  5: FOXD4L4 ILMN_3238576         1.06      1.2e-04 PSR09015126.hg.1
##  6:    GYPB ILMN_1683093         1.15      4.1e-05 PSR04029413.hg.1
##  7:    GZMB ILMN_2109489         1.10      8.0e-05 PSR14011271.hg.1
##  8:    ISCU ILMN_1735432         1.08      1.0e-04 JUC12005889.hg.1
##  9:  SEPHS1 ILMN_1673369         1.09      8.2e-05 JUC10007538.hg.1
## 10:   TSTA3 ILMN_1697777         1.09      9.1e-05 JUC08011706.hg.1
## 11:  ZDHHC2 ILMN_1769783         1.04      1.6e-04 JUC08000597.hg.1
##     change.cardio pvalue.cardio
##  1:          1.04       1.6e-04
##  2:          1.04       1.6e-04
##  3:         -1.05       1.5e-04
##  4:          1.20       2.0e-05
##  5:          1.07       1.1e-04
##  6:          1.12       6.1e-05
##  7:          1.08       1.0e-04
##  8:          1.05       1.5e-04
##  9:         -1.11       6.5e-05
## 10:         -1.14       4.2e-05
## 11:          1.12       5.9e-05
# Right join taking the first match
heart_2[framingham, on="gene", mult="first"]
##       ilmn_probe              gene change  pvalue i.change i.pvalue
##  1:         <NA>             SGIP1     NA      NA     1.06  8.1e-06
##  2: ILMN_2109489              GZMB   1.10 8.0e-05    -1.14  1.4e-05
##  3:         <NA>           SLC7A11     NA      NA     1.09  2.8e-05
##  4: ILMN_1725594           FAM188A  -1.18 2.8e-05     1.07  3.2e-05
##  5:         <NA> CCDC144B/CCDC144A     NA      NA     1.26  3.8e-05
##  6:         <NA>             TMTC2     NA      NA     1.07  4.7e-05
##  7: ILMN_1712400          SERPINB6   1.17 3.0e-05    -1.06  1.5e-04
##  8: ILMN_2344204             PRR13   1.14 4.6e-05     1.11  1.8e-04
##  9:         <NA>            TMEM56     NA      NA     1.12  2.3e-04
## 10:         <NA>          C20orf20     NA      NA     1.08  2.9e-04
## 11:         <NA>              GATS     NA      NA    -1.05  3.1e-04
## 12: ILMN_1665730            ABCB10   1.05 1.4e-04     1.09  3.5e-04
## 13: ILMN_2409062              ISCU   1.11 6.4e-05     1.08  3.6e-04
## 14:         <NA>             PAGE1     NA      NA     1.08  4.0e-04
## 15: ILMN_1769783            ZDHHC2   1.04 1.6e-04     1.10  4.0e-04
## 16: ILMN_1714489              CIB2   1.19 2.4e-05    -1.07  4.1e-04
## 17:         <NA>             CRYGB     NA      NA    -1.08  4.4e-04
## 18: ILMN_1773006             FABP4  -1.09 9.2e-05     1.09  4.7e-04
## 19:         <NA>              FIS1     NA      NA     1.11  5.5e-04
## 20: ILMN_1657680            CCDC69  -1.05 1.5e-04    -1.05  5.5e-04
## 21: ILMN_1683093              GYPB   1.15 4.1e-05     1.12  5.8e-04
## 22:         <NA>             DLEU2     NA      NA     1.09  6.6e-04
## 23:         <NA>             RBM24     NA      NA     1.05  6.9e-04
## 24:         <NA>               HBM     NA      NA     1.11  7.6e-04
## 25: ILMN_1697777             TSTA3   1.09 9.1e-05     1.11  7.7e-04
## 26: ILMN_1749930            TMEM48  -1.06 1.3e-04     1.06  8.0e-04
## 27:         <NA>         GABARAPL2     NA      NA     1.08  8.0e-04
## 28:         <NA>            MAPRE2     NA      NA    -1.05  8.1e-04
## 29: ILMN_1794306             USP28  -1.10 7.3e-05    -1.08  8.1e-04
## 30:         <NA>              DAD1     NA      NA     1.05  8.4e-04
## 31: ILMN_1793410             SNTB1   1.23 1.2e-05     1.04  8.5e-04
## 32:         <NA>             STX1B     NA      NA     1.07  8.6e-04
## 33:         <NA>             PRPF3     NA      NA    -1.05  8.8e-04
## 34:         <NA>              GUK1     NA      NA     1.12  9.1e-04
## 35:         <NA>             CHPT1     NA      NA     1.09  9.7e-04
##       ilmn_probe              gene change  pvalue i.change i.pvalue
# Anti-join
reproducible[!framingham, on="gene"]
##       gene   ilmn_probe change.heart pvalue.heart       affy_probe
## 1:         ILMN_3206111         1.06      1.4e-04 PSR20002720.hg.1
## 2: FOXD4L4 ILMN_3238576         1.06      1.2e-04 PSR09015126.hg.1
## 3:  SEPHS1 ILMN_1673369         1.09      8.2e-05 JUC10007538.hg.1
##    change.cardio pvalue.cardio
## 1:          1.04       1.6e-04
## 2:          1.07       1.1e-04
## 3:         -1.11       6.5e-05

Chapter 4 - Concatenating and Reshaping data.table

Concatenating two or more data.table:

  • Can concatenate tables that have rows split across multiple tables (e.g., multiple file reads)
    • rbind(sales_2015, sales_2016)
    • rbind(“2015” = sales_2015, “2016” = sales_2016, idcol = “year”) # new column year will use the names for the tables
    • rbind(sales_2015, sales_2016, idcol = TRUE) # column gets the default name .id
    • rbind(“2015” = sales_2015, “2016” = sales_2016, idcol = “year”, fill = TRUE) # allows for misaligned columns, filled with NA as appropriate
  • Can also rbindlist() for tables that are stored as lists
    • table_files <- c(“sales_2015.csv”, “sales_2016.csv”)
    • list_of_tables <- lapply(table_files, fread)
    • rbindlist(list_of_tables)
    • rbind(“2015” = sales_2015, “2016” = sales_2016, idcol = “year”, use.names = TRUE) # use.names matches columns by name
    • rbind(“2015” = sales_2015, “2016” = sales_2016, idcol = “year”, use.names = FALSE) # matches columns by ORDER rather than by name (name mismatch explicitly allowed)

Set operations:

  • Can run set operations with data.tables()
    • fintersect(): what rows do these two data.tables share in common? Duplicate rows are ignored by default, but can set all=TRUE to get all the copies printed
    • fsetdiff(): what rows are unique to this data.table? Will return the rows in the first table that are not in the second table; can set all=TRUE to get ALL the rows from the first table
    • funion(): what is the unique set of rows across these two data.tables? Returns all the unique rows found in either table, with duplicates ignired by default; set all=TRUE to get everything, which is equivalent to rbind

Melting data.tables:

  • May want to melt a data.table from wide format to long format
    • melt(sales_wide, measure.vars = c(“2015”, “2016”)) # the default for the new column names are “variable” and “value”
    • melt(sales_wide, id.vars = “quarter”, variable.name = “year”, value.name = “amount”) # give the columns names that differ from the defaults; keeps only the columns specified, dropping all the others (?)

Casting data.tables:

  • Can cast a long data.table to a wide format
    • sales_wide <- dcast(sales_long, quarter ~ year, value.var = “amount”) # the value.var is the name of the column that will be split, while a ~ b means a will be the new rows and b is the column to throw in to the new columns
    • dcast(profit_long, quarter ~ year, value.var = c(“revenue”, “profit”))
    • dcast(sales_long, quarter ~ department + year, value.var = “amount”) # by default, department and year will be combined with an “_"
    • sales_wide <- dcast(sales_long, season ~ year, value.var = “amount”)
    • mat <- as.matrix(sales_wide, rownames = “season”)

Example code includes:

ebola_W50 <- fread("./RInputFiles/ebola_2014_W50.csv")
ebola_W51 <- fread("./RInputFiles/ebola_2014_W51.csv")
ebola_W52 <- fread("./RInputFiles/ebola_2014_W52.csv")


# Concatenate case numbers from weeks 50 and 51
rbind(ebola_W50, ebola_W51)
##        Location period_code     period_start       period_end Confirmed
##  1:     CONAKRY    2014-W50 08 December 2014 14 December 2014        37
##  2:       COYAH    2014-W50 08 December 2014 14 December 2014        31
##  3:      DABOLA    2014-W50 08 December 2014 14 December 2014         2
##  4:     DUBREKA    2014-W50 08 December 2014 14 December 2014         6
##  5:     FARANAH    2014-W50 08 December 2014 14 December 2014         0
##  6:  FORECARIAH    2014-W50 08 December 2014 14 December 2014        22
##  7:   GUECKEDOU    2014-W50 08 December 2014 14 December 2014         3
##  8:      KANKAN    2014-W50 08 December 2014 14 December 2014         2
##  9:    KEROUANE    2014-W50 08 December 2014 14 December 2014         9
## 10:      KINDIA    2014-W50 08 December 2014 14 December 2014         1
## 11: KISSIDOUGOU    2014-W50 08 December 2014 14 December 2014         2
## 12:   KOUROUSSA    2014-W50 08 December 2014 14 December 2014         3
## 13:        LOLA    2014-W50 08 December 2014 14 December 2014         4
## 14:     MACENTA    2014-W50 08 December 2014 14 December 2014         6
## 15:  N'ZEREKORE    2014-W50 08 December 2014 14 December 2014        13
## 16:     SIGUIRI    2014-W50 08 December 2014 14 December 2014         0
## 17:    TELIMELE    2014-W50 08 December 2014 14 December 2014        10
## 18:       BEYLA    2014-W51 15 December 2014 21 December 2014         3
## 19:     CONAKRY    2014-W51 15 December 2014 21 December 2014        39
## 20:       COYAH    2014-W51 15 December 2014 21 December 2014        35
## 21:      DABOLA    2014-W51 15 December 2014 21 December 2014         1
## 22:     DUBREKA    2014-W51 15 December 2014 21 December 2014         5
## 23:  FORECARIAH    2014-W51 15 December 2014 21 December 2014        15
## 24:        FRIA    2014-W51 15 December 2014 21 December 2014         1
## 25:   GUECKEDOU    2014-W51 15 December 2014 21 December 2014         7
## 26:      KANKAN    2014-W51 15 December 2014 21 December 2014         2
## 27:    KEROUANE    2014-W51 15 December 2014 21 December 2014        17
## 28:      KINDIA    2014-W51 15 December 2014 21 December 2014        21
## 29: KISSIDOUGOU    2014-W51 15 December 2014 21 December 2014       117
## 30:        LOLA    2014-W51 15 December 2014 21 December 2014        14
## 31:     MACENTA    2014-W51 15 December 2014 21 December 2014        24
## 32:  N'ZEREKORE    2014-W51 15 December 2014 21 December 2014        18
## 33:    TELIMELE    2014-W51 15 December 2014 21 December 2014         6
##        Location period_code     period_start       period_end Confirmed
##     Probable
##  1:        6
##  2:        5
##  3:        0
##  4:        3
##  5:       14
##  6:        1
##  7:        0
##  8:        0
##  9:        0
## 10:       24
## 11:        0
## 12:        0
## 13:        0
## 14:        0
## 15:        0
## 16:        4
## 17:        0
## 18:        0
## 19:        6
## 20:        0
## 21:        0
## 22:        1
## 23:        0
## 24:        0
## 25:        0
## 26:        0
## 27:        0
## 28:        6
## 29:       31
## 30:        1
## 31:        2
## 32:        0
## 33:        0
##     Probable
# Intentionally throws an error
# Concatenate case numbers from all three weeks
# rbind(ebola_W50, ebola_W51, ebola_W52)

# Modify the code
rbind(ebola_W50, ebola_W51, ebola_W52, fill=TRUE)
##        Location period_code     period_start       period_end Confirmed
##  1:     CONAKRY    2014-W50 08 December 2014 14 December 2014        37
##  2:       COYAH    2014-W50 08 December 2014 14 December 2014        31
##  3:      DABOLA    2014-W50 08 December 2014 14 December 2014         2
##  4:     DUBREKA    2014-W50 08 December 2014 14 December 2014         6
##  5:     FARANAH    2014-W50 08 December 2014 14 December 2014         0
##  6:  FORECARIAH    2014-W50 08 December 2014 14 December 2014        22
##  7:   GUECKEDOU    2014-W50 08 December 2014 14 December 2014         3
##  8:      KANKAN    2014-W50 08 December 2014 14 December 2014         2
##  9:    KEROUANE    2014-W50 08 December 2014 14 December 2014         9
## 10:      KINDIA    2014-W50 08 December 2014 14 December 2014         1
## 11: KISSIDOUGOU    2014-W50 08 December 2014 14 December 2014         2
## 12:   KOUROUSSA    2014-W50 08 December 2014 14 December 2014         3
## 13:        LOLA    2014-W50 08 December 2014 14 December 2014         4
## 14:     MACENTA    2014-W50 08 December 2014 14 December 2014         6
## 15:  N'ZEREKORE    2014-W50 08 December 2014 14 December 2014        13
## 16:     SIGUIRI    2014-W50 08 December 2014 14 December 2014         0
## 17:    TELIMELE    2014-W50 08 December 2014 14 December 2014        10
## 18:       BEYLA    2014-W51 15 December 2014 21 December 2014         3
## 19:     CONAKRY    2014-W51 15 December 2014 21 December 2014        39
## 20:       COYAH    2014-W51 15 December 2014 21 December 2014        35
## 21:      DABOLA    2014-W51 15 December 2014 21 December 2014         1
## 22:     DUBREKA    2014-W51 15 December 2014 21 December 2014         5
## 23:  FORECARIAH    2014-W51 15 December 2014 21 December 2014        15
## 24:        FRIA    2014-W51 15 December 2014 21 December 2014         1
## 25:   GUECKEDOU    2014-W51 15 December 2014 21 December 2014         7
## 26:      KANKAN    2014-W51 15 December 2014 21 December 2014         2
## 27:    KEROUANE    2014-W51 15 December 2014 21 December 2014        17
## 28:      KINDIA    2014-W51 15 December 2014 21 December 2014        21
## 29: KISSIDOUGOU    2014-W51 15 December 2014 21 December 2014       117
## 30:        LOLA    2014-W51 15 December 2014 21 December 2014        14
## 31:     MACENTA    2014-W51 15 December 2014 21 December 2014        24
## 32:  N'ZEREKORE    2014-W51 15 December 2014 21 December 2014        18
## 33:    TELIMELE    2014-W51 15 December 2014 21 December 2014         6
## 34:       BEYLA    2014-W52 22 December 2014 28 December 2014         1
## 35:     CONAKRY    2014-W52 22 December 2014 28 December 2014        46
## 36:       COYAH    2014-W52 22 December 2014 28 December 2014        23
## 37:      DABOLA    2014-W52 22 December 2014 28 December 2014         2
## 38:     DUBREKA    2014-W52 22 December 2014 28 December 2014        36
## 39:     FARANAH    2014-W52 22 December 2014 28 December 2014         4
## 40:  FORECARIAH    2014-W52 22 December 2014 28 December 2014         2
## 41:   GUECKEDOU    2014-W52 22 December 2014 28 December 2014         1
## 42:      KANKAN    2014-W52 22 December 2014 28 December 2014         3
## 43:    KEROUANE    2014-W52 22 December 2014 28 December 2014         4
## 44:      KINDIA    2014-W52 22 December 2014 28 December 2014        31
## 45: KISSIDOUGOU    2014-W52 22 December 2014 28 December 2014        26
## 46:   KOUROUSSA    2014-W52 22 December 2014 28 December 2014         1
## 47:        LOLA    2014-W52 22 December 2014 28 December 2014        17
## 48:     MACENTA    2014-W52 22 December 2014 28 December 2014         4
## 49:  N'ZEREKORE    2014-W52 22 December 2014 28 December 2014         5
## 50:     SIGUIRI    2014-W52 22 December 2014 28 December 2014         2
## 51:    TELIMELE    2014-W52 22 December 2014 28 December 2014        10
##        Location period_code     period_start       period_end Confirmed
##     Probable
##  1:        6
##  2:        5
##  3:        0
##  4:        3
##  5:       14
##  6:        1
##  7:        0
##  8:        0
##  9:        0
## 10:       24
## 11:        0
## 12:        0
## 13:        0
## 14:        0
## 15:        0
## 16:        4
## 17:        0
## 18:        0
## 19:        6
## 20:        0
## 21:        0
## 22:        1
## 23:        0
## 24:        0
## 25:        0
## 26:        0
## 27:        0
## 28:        6
## 29:       31
## 30:        1
## 31:        2
## 32:        0
## 33:        0
## 34:       NA
## 35:       NA
## 36:       NA
## 37:       NA
## 38:       NA
## 39:       NA
## 40:       NA
## 41:       NA
## 42:       NA
## 43:       NA
## 44:       NA
## 45:       NA
## 46:       NA
## 47:       NA
## 48:       NA
## 49:       NA
## 50:       NA
## 51:       NA
##     Probable
gdp_africa <- fread("./RInputFiles/gdp_africa_2000.csv")
gdp_asia <- fread("./RInputFiles/gdp_asia_2000.csv")
gdp_europe <- fread("./RInputFiles/gdp_europe_2000.csv")
gdp_north_america <- fread("./RInputFiles/gdp_north_america_2000.csv")
gdp_oceania <- fread("./RInputFiles/gdp_oceania_2000.csv")
gdp_south_america <- fread("./RInputFiles/gdp_south_america_2000.csv")

gdp <- list(africa=gdp_africa, asia=gdp_asia, europe=gdp_europe, 
            north_america=gdp_north_america, oceania=gdp_oceania, south_america=gdp_south_america
            )


# Concatenate its data.tables
gdp_all_1 <- rbindlist(gdp)

# Concatenate its data.tables
gdp_all_2 <- rbindlist(gdp, idcol="continent")
str(gdp_all_2)
## Classes 'data.table' and 'data.frame':   207 obs. of  6 variables:
##  $ continent     : chr  "africa" "africa" "africa" "africa" ...
##  $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
##  $ year          : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ gdp_per_capita: num  1794 298 346 3204 212 ...
##  $ population    : int  31183658 15058638 6949366 1736579 11607944 6767073 15927713 438737 3726048 8343321 ...
##  $ total_gdp     : num  5.60e+10 4.49e+09 2.40e+09 5.56e+09 2.47e+09 ...
##  - attr(*, ".internal.selfref")=<externalptr>
gdp_all_2[95:105]
##     continent              country year gdp_per_capita population
##  1:      asia         Turkmenistan 2000       645.2771    4501419
##  2:      asia United Arab Emirates 2000     34395.1491    3050128
##  3:      asia           Uzbekistan 2000       558.2211   24518222
##  4:      asia              Vietnam 2000       401.5478   80285563
##  5:      asia   West Bank and Gaza 2000      1369.1930    3223781
##  6:      asia                Yemen 2000       543.7139   17795219
##  7:    europe              Albania 2000      1200.1374    3121965
##  8:    europe              Andorra 2000     17539.4420      65399
##  9:    europe              Armenia 2000       621.4248    3076098
## 10:    europe              Austria 2000     23974.1831    8050884
## 11:    europe           Azerbaijan 2000       655.0974    8117742
##        total_gdp
##  1:   2904662605
##  2: 104909607426
##  3:  13686589934
##  4:  32238491363
##  5:   4413978466
##  6:   9675507618
##  7:   3746787074
##  8:   1147061967
##  9:   1911563665
## 10: 193013366891
## 11:   5317911943
# Fix the problem
gdp_all_3 <- rbindlist(gdp, idcol = "continent", use.names=TRUE)
gdp_all_3
##          continent      country year gdp_per_capita population
##   1:        africa      Algeria 2000      1794.4052   31183658
##   2:        africa       Angola 2000       298.4058   15058638
##   3:        africa        Benin 2000       345.9504    6949366
##   4:        africa     Botswana 2000      3204.1155    1736579
##   5:        africa Burkina Faso 2000       212.3754   11607944
##  ---                                                          
## 203: south_america     Paraguay 2000      1323.4819    5302703
## 204: south_america         Peru 2000      2060.5763   25914875
## 205: south_america     Suriname 2000      1911.0463     480751
## 206: south_america      Uruguay 2000      6914.3626    3321242
## 207: south_america    Venezuela 2000      4818.7082   24481477
##         total_gdp
##   1:  55956119099
##   2:   4493585565
##   3:   2404135636
##   4:   5564199661
##   5:   2465241592
##  ---             
## 203:   7018031622
## 204:  53399576128
## 205:    918737427
## 206:  22964271411
## 207: 117969093478
# Obtain countries in both Asia and Europe
fintersect(gdp$europe, gdp$asia)
##       country year gdp_per_capita population    total_gdp
## 1:    Armenia 2000       621.4248    3076098   1911563665
## 2: Azerbaijan 2000       655.0974    8117742   5317911943
## 3:     Cyprus 2000     13421.6554     943287  12660473076
## 4:    Georgia 2000       691.9977    4743591   3282554086
## 5: Kazakhstan 2000      1229.0010   14956769  18381883430
## 6:     Russia 2000      1775.1413  146400951 259882373162
## 7:     Turkey 2000      4189.4781   63240157 264943250389
# Concatenate all data tables
gdp_all <- rbindlist(gdp, use.names=TRUE)

# Find all countries that span multiple continents
gdp_all[duplicated(gdp_all)]
##       country year gdp_per_capita population    total_gdp
## 1:    Armenia 2000       621.4248    3076098   1911563665
## 2: Azerbaijan 2000       655.0974    8117742   5317911943
## 3:     Cyprus 2000     13421.6554     943287  12660473076
## 4:    Georgia 2000       691.9977    4743591   3282554086
## 5: Kazakhstan 2000      1229.0010   14956769  18381883430
## 6:     Russia 2000      1775.1413  146400951 259882373162
## 7:     Turkey 2000      4189.4781   63240157 264943250389
# Get all countries in either Asia or Europe
funion(gdp$europe, gdp$asia)
##                    country year gdp_per_capita population    total_gdp
##  1:                Albania 2000      1200.1374    3121965 3.746787e+09
##  2:                Andorra 2000     17539.4420      65399 1.147062e+09
##  3:                Armenia 2000       621.4248    3076098 1.911564e+09
##  4:                Austria 2000     23974.1831    8050884 1.930134e+11
##  5:             Azerbaijan 2000       655.0974    8117742 5.317912e+09
##  6:                Belarus 2000      1273.0491    9952055 1.266945e+10
##  7:                Belgium 2000     22697.0123   10268380 2.330615e+11
##  8: Bosnia and Herzegovina 2000      1490.6428    3792878 5.653826e+09
##  9:               Bulgaria 2000      1579.3482    8000510 1.263559e+10
## 10:        Channel Islands 2000     44310.9023     148725 6.590139e+09
## 11:                Croatia 2000      4861.6775    4428069 2.152784e+10
## 12:                 Cyprus 2000     13421.6554     943287 1.266047e+10
## 13:         Czech Republic 2000      5724.8375   10263010 5.875406e+10
## 14:                Denmark 2000     29980.1555    5338283 1.600426e+11
## 15:                Estonia 2000      4143.9272    1399145 5.797955e+09
## 16:         Faeroe Islands 2000     23224.0986      46491 1.079712e+09
## 17:                Finland 2000     23529.5385    5176482 1.218002e+11
## 18:                 France 2000     21774.9930   59387183 1.293155e+12
## 19:                Georgia 2000       691.9977    4743591 3.282554e+09
## 20:                Germany 2000     22945.7088   81895925 1.879160e+12
## 21:                 Greece 2000     11396.2326   10954032 1.248347e+11
## 22:                Hungary 2000      4542.7207   10224113 4.644529e+10
## 23:                Iceland 2000     30928.6756     281214 8.697577e+09
## 24:                Ireland 2000     25629.6501    3841574 9.845820e+10
## 25:            Isle of Man 2000     20359.4625      76806 1.563729e+09
## 26:                  Italy 2000     19388.2788   57147081 1.107984e+12
## 27:             Kazakhstan 2000      1229.0010   14956769 1.838188e+10
## 28:                 Kosovo 2000      1087.7624    2608347 2.837262e+09
## 29:                 Latvia 2000      3300.9347    2371481 7.828104e+09
## 30:          Liechtenstein 2000     75606.2032      33282 2.516326e+09
## 31:              Lithuania 2000      3267.3474    3486373 1.139119e+10
## 32:             Luxembourg 2000     46453.2458     436107 2.025859e+10
## 33:         Macedonia, FYR 2000      1785.3268    2012051 3.592169e+09
## 34:                  Malta 2000     10377.0373     387180 4.017781e+09
## 35:                Moldova 2000       354.0017    4201088 1.487192e+09
## 36:                 Monaco 2000     75382.4466      32081 2.418344e+09
## 37:             Montenegro 2000      1555.9125     613557 9.546410e+08
## 38:            Netherlands 2000     24179.7314   15894016 3.843130e+11
## 39:                 Norway 2000     37472.6717    4491572 1.683112e+11
## 40:                 Poland 2000      4454.0802   38486305 1.714211e+11
## 41:               Portugal 2000     11470.8979   10278542 1.179041e+11
## 42:                Romania 2000      1650.9684   22128139 3.653286e+10
## 43:                 Russia 2000      1775.1413  146400951 2.598824e+11
## 44:             San Marino 2000     28698.3217      27420 7.869080e+08
## 45:                 Serbia 2000       809.2751    9463306 7.658418e+09
## 46:        Slovak Republic 2000      5330.4016    5386065 2.870989e+10
## 47:               Slovenia 2000     10045.3601    1988652 1.997673e+10
## 48:                  Spain 2000     14413.7889   40749800 5.873590e+11
## 49:                 Sweden 2000     27869.3776    8872284 2.472650e+11
## 50:            Switzerland 2000     35639.4789    7165581 2.553776e+11
## 51:                 Turkey 2000      4189.4781   63240157 2.649433e+11
## 52:                Ukraine 2000       635.7090   48746269 3.098844e+10
## 53:         United Kingdom 2000     25057.6135   58867004 1.475067e+12
## 54:                Bahrain 2000     12489.4677     666855 8.328664e+09
## 55:             Bangladesh 2000       363.6399  131280739 4.773892e+10
## 56:                 Bhutan 2000       768.7510     564187 4.337193e+08
## 57:                 Brunei 2000     18350.1306     330554 6.065709e+09
## 58:               Cambodia 2000       293.5685   12197905 3.580920e+09
## 59:                  China 2000       949.1781 1269974572 1.205432e+12
## 60:       Hong Kong, China 2000     25756.6638    6783502 1.747204e+11
## 61:                  India 2000       450.4151 1053481072 4.745038e+11
## 62:              Indonesia 2000       773.3110  211540428 1.635865e+11
## 63:                   Iran 2000      1550.0906   65850062 1.020736e+11
## 64:                   Iraq 2000      1063.4815   23574751 2.507131e+10
## 65:                 Israel 2000     19859.3021    6013711 1.194281e+11
## 66:                  Japan 2000     37291.7062  125714674 4.688115e+12
## 67:                 Jordan 2000      1764.2299    4767476 8.410924e+09
## 68:                 Kuwait 2000     19434.4000    1929470 3.749809e+10
## 69:        Kyrgyz Republic 2000       279.6203    4954850 1.385477e+09
## 70:                    Lao 2000       325.5931    5342879 1.739605e+09
## 71:                Lebanon 2000      4612.1987    3235380 1.492222e+10
## 72:           Macao, China 2000     14128.8752     431907 6.102360e+09
## 73:               Malaysia 2000      4005.5563   23420751 9.381314e+10
## 74:               Maldives 2000      2284.9740     280384 6.406701e+08
## 75:               Mongolia 2000       471.4733    2397438 1.130328e+09
## 76:                  Nepal 2000       225.1687   23740145 5.345537e+09
## 77:                   Oman 2000      8774.9338    2239403 1.965061e+10
## 78:               Pakistan 2000       511.7026  138250487 7.074313e+10
## 79:            Philippines 2000      1048.0705   77932247 8.167849e+10
## 80:                  Qatar 2000     30052.7612     593453 1.783490e+10
## 81:           Saudi Arabia 2000      9400.8117   21392273 2.011047e+11
## 82:              Singapore 2000     23814.5566    3918183 9.330979e+10
## 83:            South Korea 2000     11346.6650   46206271 5.242871e+11
## 84:              Sri Lanka 2000       854.9267   18783745 1.605873e+10
## 85:                  Syria 2000      1208.7346   16354050 1.976771e+10
## 86:             Tajikistan 2000       139.4099    6186152 8.624107e+08
## 87:               Thailand 2000      1943.2379   62693322 1.218280e+11
## 88:            Timor-Leste 2000       380.9230     847185 3.227123e+08
## 89:           Turkmenistan 2000       645.2771    4501419 2.904663e+09
## 90:   United Arab Emirates 2000     34395.1491    3050128 1.049096e+11
## 91:             Uzbekistan 2000       558.2211   24518222 1.368659e+10
## 92:                Vietnam 2000       401.5478   80285563 3.223849e+10
## 93:     West Bank and Gaza 2000      1369.1930    3223781 4.413978e+09
## 94:                  Yemen 2000       543.7139   17795219 9.675508e+09
##                    country year gdp_per_capita population    total_gdp
# Concatenate all data tables
gdp_all <- rbindlist(gdp, use.names=TRUE)

# Print all unique countries
unique(gdp_all)
##           country year gdp_per_capita population    total_gdp
##   1:      Algeria 2000      1794.4052   31183658  55956119099
##   2:       Angola 2000       298.4058   15058638   4493585565
##   3:        Benin 2000       345.9504    6949366   2404135636
##   4:     Botswana 2000      3204.1155    1736579   5564199661
##   5: Burkina Faso 2000       212.3754   11607944   2465241592
##  ---                                                         
## 196:     Paraguay 2000      1323.4819    5302703   7018031622
## 197:         Peru 2000      2060.5763   25914875  53399576128
## 198:     Suriname 2000      1911.0463     480751    918737427
## 199:      Uruguay 2000      6914.3626    3321242  22964271411
## 200:    Venezuela 2000      4818.7082   24481477 117969093478
gdp_middle_east <- fread("./RInputFiles/gdp_middle_east_2000.csv")

# Which countries are in Africa but not considered part of the middle east?
fsetdiff(gdp$africa, gdp_middle_east)
##                      country year gdp_per_capita population    total_gdp
##  1:                  Algeria 2000     1794.40523   31183658  55956119099
##  2:                   Angola 2000      298.40584   15058638   4493585565
##  3:                    Benin 2000      345.95036    6949366   2404135636
##  4:                 Botswana 2000     3204.11548    1736579   5564199661
##  5:             Burkina Faso 2000      212.37539   11607944   2465241592
##  6:                  Burundi 2000      131.04633    6767073    886800109
##  7:                 Cameroon 2000      592.37200   15927713   9435131210
##  8:               Cape Verde 2000     1233.25752     438737    541075706
##  9: Central African Republic 2000      247.02818    3726048    920438872
## 10:                     Chad 2000      168.45000    8343321   1405432403
## 11:                  Comoros 2000      358.95291     547696    196597072
## 12:         Congo, Dem. Rep. 2000       86.75451   48048664   4168438144
## 13:              Congo, Rep. 2000     1026.83155    3109269   3192695522
## 14:            Cote d'Ivoire 2000      628.22811   16517948  10377039332
## 15:                 Djibouti 2000      753.11965     722562    544175637
## 16:        Equatorial Guinea 2000     2388.34863     530896   1267964732
## 17:                  Eritrea 2000      172.75716    3535156    610723497
## 18:                 Ethiopia 2000      123.68092   66443603   8217805735
## 19:                    Gabon 2000     4102.62460    1231548   5052579123
## 20:                   Gambia 2000      603.59535    1228863    741735988
## 21:                    Ghana 2000      259.99069   18824994   4894323262
## 22:                   Guinea 2000      358.96291    8799165   3158573866
## 23:            Guinea-Bissau 2000      173.66269    1315455    228445460
## 24:                    Kenya 2000      406.52306   31065820  12628972195
## 25:                  Lesotho 2000      379.77562    1856225    704949005
## 26:                  Liberia 2000      185.81340    2891968    537366418
## 27:                    Libya 2000     6479.71252    5337264  34583936374
## 28:               Madagascar 2000      252.38252   15744811   3973715025
## 29:                   Malawi 2000      155.27154   11193230   1737990110
## 30:                     Mali 2000      214.46659   11046926   2369196569
## 31:               Mauritania 2000      489.51164    2711421   1327272154
## 32:                Mauritius 2000     3861.03854    1185143   4575882801
## 33:                  Morocco 2000     1271.81109   28950553  36819634425
## 34:               Mozambique 2000      233.43927   18264536   4263659870
## 35:                  Namibia 2000     2061.62097    1897953   3912859707
## 36:                    Niger 2000      164.64895   11224523   1848105899
## 37:                  Nigeria 2000      371.76808  122876723  45681643436
## 38:                   Rwanda 2000      214.23370    8021875   1718555953
## 39:    Sao Tome and Principe 2000      543.99904     137164     74617084
## 40:                  Senegal 2000      492.28623    9860578   4854226722
## 41:               Seychelles 2000     7578.85105      81154    615054078
## 42:             Sierra Leone 2000      153.47797    4060709    623229365
## 43:             South Africa 2000     3019.94637   44896856 135586097167
## 44:                    Sudan 2000      358.52920   28079664  10067379498
## 45:                Swaziland 2000     1508.18366    1063715   1604277582
## 46:                 Tanzania 2000      307.98615   33991590  10468939093
## 47:                     Togo 2000      270.00088    4874735   1316182721
## 48:                  Tunisia 2000     2245.33506    9699192  21777935842
## 49:                   Uganda 2000      255.78061   23757636   6076742657
## 50:                   Zambia 2000      318.92682   10585220   3375910577
## 51:                 Zimbabwe 2000      534.79115   12499981   6684879233
##                      country year gdp_per_capita population    total_gdp
# Which countries are in Asia but not considered part of the middle east?
fsetdiff(gdp$asia, gdp_middle_east)
##              country year gdp_per_capita population    total_gdp
##  1:          Armenia 2000       621.4248    3076098 1.911564e+09
##  2:       Azerbaijan 2000       655.0974    8117742 5.317912e+09
##  3:       Bangladesh 2000       363.6399  131280739 4.773892e+10
##  4:           Bhutan 2000       768.7510     564187 4.337193e+08
##  5:           Brunei 2000     18350.1306     330554 6.065709e+09
##  6:         Cambodia 2000       293.5685   12197905 3.580920e+09
##  7:            China 2000       949.1781 1269974572 1.205432e+12
##  8:          Georgia 2000       691.9977    4743591 3.282554e+09
##  9: Hong Kong, China 2000     25756.6638    6783502 1.747204e+11
## 10:            India 2000       450.4151 1053481072 4.745038e+11
## 11:        Indonesia 2000       773.3110  211540428 1.635865e+11
## 12:            Japan 2000     37291.7062  125714674 4.688115e+12
## 13:       Kazakhstan 2000      1229.0010   14956769 1.838188e+10
## 14:  Kyrgyz Republic 2000       279.6203    4954850 1.385477e+09
## 15:              Lao 2000       325.5931    5342879 1.739605e+09
## 16:     Macao, China 2000     14128.8752     431907 6.102360e+09
## 17:         Malaysia 2000      4005.5563   23420751 9.381314e+10
## 18:         Maldives 2000      2284.9740     280384 6.406701e+08
## 19:         Mongolia 2000       471.4733    2397438 1.130328e+09
## 20:            Nepal 2000       225.1687   23740145 5.345537e+09
## 21:         Pakistan 2000       511.7026  138250487 7.074313e+10
## 22:      Philippines 2000      1048.0705   77932247 8.167849e+10
## 23:           Russia 2000      1775.1413  146400951 2.598824e+11
## 24:        Singapore 2000     23814.5566    3918183 9.330979e+10
## 25:      South Korea 2000     11346.6650   46206271 5.242871e+11
## 26:        Sri Lanka 2000       854.9267   18783745 1.605873e+10
## 27:       Tajikistan 2000       139.4099    6186152 8.624107e+08
## 28:         Thailand 2000      1943.2379   62693322 1.218280e+11
## 29:      Timor-Leste 2000       380.9230     847185 3.227123e+08
## 30:     Turkmenistan 2000       645.2771    4501419 2.904663e+09
## 31:       Uzbekistan 2000       558.2211   24518222 1.368659e+10
## 32:          Vietnam 2000       401.5478   80285563 3.223849e+10
##              country year gdp_per_capita population    total_gdp
# Which countries are in Europe but not considered part of the middle east?
fsetdiff(gdp$europe, gdp_middle_east)
##                    country year gdp_per_capita population    total_gdp
##  1:                Albania 2000      1200.1374    3121965 3.746787e+09
##  2:                Andorra 2000     17539.4420      65399 1.147062e+09
##  3:                Armenia 2000       621.4248    3076098 1.911564e+09
##  4:                Austria 2000     23974.1831    8050884 1.930134e+11
##  5:             Azerbaijan 2000       655.0974    8117742 5.317912e+09
##  6:                Belarus 2000      1273.0491    9952055 1.266945e+10
##  7:                Belgium 2000     22697.0123   10268380 2.330615e+11
##  8: Bosnia and Herzegovina 2000      1490.6428    3792878 5.653826e+09
##  9:               Bulgaria 2000      1579.3482    8000510 1.263559e+10
## 10:        Channel Islands 2000     44310.9023     148725 6.590139e+09
## 11:                Croatia 2000      4861.6775    4428069 2.152784e+10
## 12:         Czech Republic 2000      5724.8375   10263010 5.875406e+10
## 13:                Denmark 2000     29980.1555    5338283 1.600426e+11
## 14:                Estonia 2000      4143.9272    1399145 5.797955e+09
## 15:         Faeroe Islands 2000     23224.0986      46491 1.079712e+09
## 16:                Finland 2000     23529.5385    5176482 1.218002e+11
## 17:                 France 2000     21774.9930   59387183 1.293155e+12
## 18:                Georgia 2000       691.9977    4743591 3.282554e+09
## 19:                Germany 2000     22945.7088   81895925 1.879160e+12
## 20:                 Greece 2000     11396.2326   10954032 1.248347e+11
## 21:                Hungary 2000      4542.7207   10224113 4.644529e+10
## 22:                Iceland 2000     30928.6756     281214 8.697577e+09
## 23:                Ireland 2000     25629.6501    3841574 9.845820e+10
## 24:            Isle of Man 2000     20359.4625      76806 1.563729e+09
## 25:                  Italy 2000     19388.2788   57147081 1.107984e+12
## 26:             Kazakhstan 2000      1229.0010   14956769 1.838188e+10
## 27:                 Kosovo 2000      1087.7624    2608347 2.837262e+09
## 28:                 Latvia 2000      3300.9347    2371481 7.828104e+09
## 29:          Liechtenstein 2000     75606.2032      33282 2.516326e+09
## 30:              Lithuania 2000      3267.3474    3486373 1.139119e+10
## 31:             Luxembourg 2000     46453.2458     436107 2.025859e+10
## 32:         Macedonia, FYR 2000      1785.3268    2012051 3.592169e+09
## 33:                  Malta 2000     10377.0373     387180 4.017781e+09
## 34:                Moldova 2000       354.0017    4201088 1.487192e+09
## 35:                 Monaco 2000     75382.4466      32081 2.418344e+09
## 36:             Montenegro 2000      1555.9125     613557 9.546410e+08
## 37:            Netherlands 2000     24179.7314   15894016 3.843130e+11
## 38:                 Norway 2000     37472.6717    4491572 1.683112e+11
## 39:                 Poland 2000      4454.0802   38486305 1.714211e+11
## 40:               Portugal 2000     11470.8979   10278542 1.179041e+11
## 41:                Romania 2000      1650.9684   22128139 3.653286e+10
## 42:                 Russia 2000      1775.1413  146400951 2.598824e+11
## 43:             San Marino 2000     28698.3217      27420 7.869080e+08
## 44:                 Serbia 2000       809.2751    9463306 7.658418e+09
## 45:        Slovak Republic 2000      5330.4016    5386065 2.870989e+10
## 46:               Slovenia 2000     10045.3601    1988652 1.997673e+10
## 47:                  Spain 2000     14413.7889   40749800 5.873590e+11
## 48:                 Sweden 2000     27869.3776    8872284 2.472650e+11
## 49:            Switzerland 2000     35639.4789    7165581 2.553776e+11
## 50:                Ukraine 2000       635.7090   48746269 3.098844e+10
## 51:         United Kingdom 2000     25057.6135   58867004 1.475067e+12
##                    country year gdp_per_capita population    total_gdp
gdp_per_capita_wrong <- fread("./RInputFiles/gdp_per_capita_oceania.csv")

colNames <- gdp_per_capita_wrong$V1
colNames[1] <- "year"
numData <- t(gdp_per_capita_wrong[, -1])

gdp_per_capita <- as.data.table(numData)
colnames(gdp_per_capita) <- colNames


# Print gdp_per_capita
gdp_per_capita
##    year Australia     Fiji French Polynesia Kiribati Marshall Islands
## 1: 1990  17553.38 1833.184         14003.26 670.2639         2137.690
## 2: 1995  18690.44 1955.294         13608.05 661.5434         2581.130
## 3: 2000  21708.04 2074.747         14507.54 812.2821         2127.485
## 4: 2005  23929.16 2308.158               NA 785.3737         2368.149
## 5: 2010  25190.84 2218.147               NA 713.5625         2437.282
##    Micronesia, Fed. Sts. New Caledonia New Zealand    Palau
## 1:              1894.105      13562.65    11627.11       NA
## 2:              2189.552      13601.46    12291.84 6355.627
## 3:              2177.589      12579.60    13375.78 6251.982
## 4:              2196.247            NA    15171.59 6429.586
## 5:              2134.037            NA    14629.22 5756.811
##    Papua New Guinea    Samoa Solomon Islands    Tonga   Tuvalu  Vanuatu
## 1:         566.8707 1191.783       1101.2480 1493.743 1128.078 1327.357
## 2:         754.7247 1204.353       1403.8032 1785.211 1278.696 1358.466
## 3:         654.6688 1391.214       1064.5147 1926.005 1458.950 1469.762
## 4:         626.3082 1742.154        975.6054 2115.027 1426.191 1355.822
## 5:         744.2105 1769.565       1143.7875 2069.226 1559.984 1522.384
# Reshape gdp_per_capita to the long format
melt(gdp_per_capita, id.vars="year")
##     year              variable      value
##  1: 1990             Australia 17553.3768
##  2: 1995             Australia 18690.4366
##  3: 2000             Australia 21708.0373
##  4: 2005             Australia 23929.1644
##  5: 2010             Australia 25190.8399
##  6: 1990                  Fiji  1833.1844
##  7: 1995                  Fiji  1955.2939
##  8: 2000                  Fiji  2074.7473
##  9: 2005                  Fiji  2308.1583
## 10: 2010                  Fiji  2218.1470
## 11: 1990      French Polynesia 14003.2649
## 12: 1995      French Polynesia 13608.0520
## 13: 2000      French Polynesia 14507.5415
## 14: 2005      French Polynesia         NA
## 15: 2010      French Polynesia         NA
## 16: 1990              Kiribati   670.2639
## 17: 1995              Kiribati   661.5434
## 18: 2000              Kiribati   812.2821
## 19: 2005              Kiribati   785.3737
## 20: 2010              Kiribati   713.5625
## 21: 1990      Marshall Islands  2137.6897
## 22: 1995      Marshall Islands  2581.1304
## 23: 2000      Marshall Islands  2127.4855
## 24: 2005      Marshall Islands  2368.1488
## 25: 2010      Marshall Islands  2437.2824
## 26: 1990 Micronesia, Fed. Sts.  1894.1055
## 27: 1995 Micronesia, Fed. Sts.  2189.5520
## 28: 2000 Micronesia, Fed. Sts.  2177.5891
## 29: 2005 Micronesia, Fed. Sts.  2196.2472
## 30: 2010 Micronesia, Fed. Sts.  2134.0372
## 31: 1990         New Caledonia 13562.6526
## 32: 1995         New Caledonia 13601.4555
## 33: 2000         New Caledonia 12579.5951
## 34: 2005         New Caledonia         NA
## 35: 2010         New Caledonia         NA
## 36: 1990           New Zealand 11627.1065
## 37: 1995           New Zealand 12291.8376
## 38: 2000           New Zealand 13375.7805
## 39: 2005           New Zealand 15171.5943
## 40: 2010           New Zealand 14629.2181
## 41: 1990                 Palau         NA
## 42: 1995                 Palau  6355.6270
## 43: 2000                 Palau  6251.9821
## 44: 2005                 Palau  6429.5859
## 45: 2010                 Palau  5756.8105
## 46: 1990      Papua New Guinea   566.8707
## 47: 1995      Papua New Guinea   754.7247
## 48: 2000      Papua New Guinea   654.6688
## 49: 2005      Papua New Guinea   626.3082
## 50: 2010      Papua New Guinea   744.2105
## 51: 1990                 Samoa  1191.7833
## 52: 1995                 Samoa  1204.3528
## 53: 2000                 Samoa  1391.2144
## 54: 2005                 Samoa  1742.1545
## 55: 2010                 Samoa  1769.5650
## 56: 1990       Solomon Islands  1101.2480
## 57: 1995       Solomon Islands  1403.8032
## 58: 2000       Solomon Islands  1064.5147
## 59: 2005       Solomon Islands   975.6054
## 60: 2010       Solomon Islands  1143.7875
## 61: 1990                 Tonga  1493.7433
## 62: 1995                 Tonga  1785.2106
## 63: 2000                 Tonga  1926.0046
## 64: 2005                 Tonga  2115.0271
## 65: 2010                 Tonga  2069.2262
## 66: 1990                Tuvalu  1128.0776
## 67: 1995                Tuvalu  1278.6963
## 68: 2000                Tuvalu  1458.9496
## 69: 2005                Tuvalu  1426.1913
## 70: 2010                Tuvalu  1559.9837
## 71: 1990               Vanuatu  1327.3566
## 72: 1995               Vanuatu  1358.4658
## 73: 2000               Vanuatu  1469.7618
## 74: 2005               Vanuatu  1355.8217
## 75: 2010               Vanuatu  1522.3840
##     year              variable      value
# Rename the new columns
melt(gdp_per_capita, id.vars = "year", variable.name="country", value.name="gdp_pc")
##     year               country     gdp_pc
##  1: 1990             Australia 17553.3768
##  2: 1995             Australia 18690.4366
##  3: 2000             Australia 21708.0373
##  4: 2005             Australia 23929.1644
##  5: 2010             Australia 25190.8399
##  6: 1990                  Fiji  1833.1844
##  7: 1995                  Fiji  1955.2939
##  8: 2000                  Fiji  2074.7473
##  9: 2005                  Fiji  2308.1583
## 10: 2010                  Fiji  2218.1470
## 11: 1990      French Polynesia 14003.2649
## 12: 1995      French Polynesia 13608.0520
## 13: 2000      French Polynesia 14507.5415
## 14: 2005      French Polynesia         NA
## 15: 2010      French Polynesia         NA
## 16: 1990              Kiribati   670.2639
## 17: 1995              Kiribati   661.5434
## 18: 2000              Kiribati   812.2821
## 19: 2005              Kiribati   785.3737
## 20: 2010              Kiribati   713.5625
## 21: 1990      Marshall Islands  2137.6897
## 22: 1995      Marshall Islands  2581.1304
## 23: 2000      Marshall Islands  2127.4855
## 24: 2005      Marshall Islands  2368.1488
## 25: 2010      Marshall Islands  2437.2824
## 26: 1990 Micronesia, Fed. Sts.  1894.1055
## 27: 1995 Micronesia, Fed. Sts.  2189.5520
## 28: 2000 Micronesia, Fed. Sts.  2177.5891
## 29: 2005 Micronesia, Fed. Sts.  2196.2472
## 30: 2010 Micronesia, Fed. Sts.  2134.0372
## 31: 1990         New Caledonia 13562.6526
## 32: 1995         New Caledonia 13601.4555
## 33: 2000         New Caledonia 12579.5951
## 34: 2005         New Caledonia         NA
## 35: 2010         New Caledonia         NA
## 36: 1990           New Zealand 11627.1065
## 37: 1995           New Zealand 12291.8376
## 38: 2000           New Zealand 13375.7805
## 39: 2005           New Zealand 15171.5943
## 40: 2010           New Zealand 14629.2181
## 41: 1990                 Palau         NA
## 42: 1995                 Palau  6355.6270
## 43: 2000                 Palau  6251.9821
## 44: 2005                 Palau  6429.5859
## 45: 2010                 Palau  5756.8105
## 46: 1990      Papua New Guinea   566.8707
## 47: 1995      Papua New Guinea   754.7247
## 48: 2000      Papua New Guinea   654.6688
## 49: 2005      Papua New Guinea   626.3082
## 50: 2010      Papua New Guinea   744.2105
## 51: 1990                 Samoa  1191.7833
## 52: 1995                 Samoa  1204.3528
## 53: 2000                 Samoa  1391.2144
## 54: 2005                 Samoa  1742.1545
## 55: 2010                 Samoa  1769.5650
## 56: 1990       Solomon Islands  1101.2480
## 57: 1995       Solomon Islands  1403.8032
## 58: 2000       Solomon Islands  1064.5147
## 59: 2005       Solomon Islands   975.6054
## 60: 2010       Solomon Islands  1143.7875
## 61: 1990                 Tonga  1493.7433
## 62: 1995                 Tonga  1785.2106
## 63: 2000                 Tonga  1926.0046
## 64: 2005                 Tonga  2115.0271
## 65: 2010                 Tonga  2069.2262
## 66: 1990                Tuvalu  1128.0776
## 67: 1995                Tuvalu  1278.6963
## 68: 2000                Tuvalu  1458.9496
## 69: 2005                Tuvalu  1426.1913
## 70: 2010                Tuvalu  1559.9837
## 71: 1990               Vanuatu  1327.3566
## 72: 1995               Vanuatu  1358.4658
## 73: 2000               Vanuatu  1469.7618
## 74: 2005               Vanuatu  1355.8217
## 75: 2010               Vanuatu  1522.3840
##     year               country     gdp_pc
# Print ebola_wide
ebola_wide <- rbind(ebola_W50, ebola_W51) %>% 
    mutate(Week_50=ifelse(period_code=="2014-W50", Confirmed, NA), 
           Week_51=ifelse(period_code=="2014-W51", Confirmed, NA)
           ) %>%
    select(Location, period_start, period_end, Week_50, Week_51) %>%
    arrange(Location, period_start)
ebola_wide
##       Location     period_start       period_end Week_50 Week_51
## 1        BEYLA 15 December 2014 21 December 2014      NA       3
## 2      CONAKRY 08 December 2014 14 December 2014      37      NA
## 3      CONAKRY 15 December 2014 21 December 2014      NA      39
## 4        COYAH 08 December 2014 14 December 2014      31      NA
## 5        COYAH 15 December 2014 21 December 2014      NA      35
## 6       DABOLA 08 December 2014 14 December 2014       2      NA
## 7       DABOLA 15 December 2014 21 December 2014      NA       1
## 8      DUBREKA 08 December 2014 14 December 2014       6      NA
## 9      DUBREKA 15 December 2014 21 December 2014      NA       5
## 10     FARANAH 08 December 2014 14 December 2014       0      NA
## 11  FORECARIAH 08 December 2014 14 December 2014      22      NA
## 12  FORECARIAH 15 December 2014 21 December 2014      NA      15
## 13        FRIA 15 December 2014 21 December 2014      NA       1
## 14   GUECKEDOU 08 December 2014 14 December 2014       3      NA
## 15   GUECKEDOU 15 December 2014 21 December 2014      NA       7
## 16      KANKAN 08 December 2014 14 December 2014       2      NA
## 17      KANKAN 15 December 2014 21 December 2014      NA       2
## 18    KEROUANE 08 December 2014 14 December 2014       9      NA
## 19    KEROUANE 15 December 2014 21 December 2014      NA      17
## 20      KINDIA 08 December 2014 14 December 2014       1      NA
## 21      KINDIA 15 December 2014 21 December 2014      NA      21
## 22 KISSIDOUGOU 08 December 2014 14 December 2014       2      NA
## 23 KISSIDOUGOU 15 December 2014 21 December 2014      NA     117
## 24   KOUROUSSA 08 December 2014 14 December 2014       3      NA
## 25        LOLA 08 December 2014 14 December 2014       4      NA
## 26        LOLA 15 December 2014 21 December 2014      NA      14
## 27     MACENTA 08 December 2014 14 December 2014       6      NA
## 28     MACENTA 15 December 2014 21 December 2014      NA      24
## 29  N'ZEREKORE 08 December 2014 14 December 2014      13      NA
## 30  N'ZEREKORE 15 December 2014 21 December 2014      NA      18
## 31     SIGUIRI 08 December 2014 14 December 2014       0      NA
## 32    TELIMELE 08 December 2014 14 December 2014      10      NA
## 33    TELIMELE 15 December 2014 21 December 2014      NA       6
# Stack Week_50 and Week_51
melt(ebola_wide, measure.vars=c("Week_50", "Week_51"), variable.name="period", value.name="cases")
##       Location     period_start       period_end  period cases
## 1        BEYLA 15 December 2014 21 December 2014 Week_50    NA
## 2      CONAKRY 08 December 2014 14 December 2014 Week_50    37
## 3      CONAKRY 15 December 2014 21 December 2014 Week_50    NA
## 4        COYAH 08 December 2014 14 December 2014 Week_50    31
## 5        COYAH 15 December 2014 21 December 2014 Week_50    NA
## 6       DABOLA 08 December 2014 14 December 2014 Week_50     2
## 7       DABOLA 15 December 2014 21 December 2014 Week_50    NA
## 8      DUBREKA 08 December 2014 14 December 2014 Week_50     6
## 9      DUBREKA 15 December 2014 21 December 2014 Week_50    NA
## 10     FARANAH 08 December 2014 14 December 2014 Week_50     0
## 11  FORECARIAH 08 December 2014 14 December 2014 Week_50    22
## 12  FORECARIAH 15 December 2014 21 December 2014 Week_50    NA
## 13        FRIA 15 December 2014 21 December 2014 Week_50    NA
## 14   GUECKEDOU 08 December 2014 14 December 2014 Week_50     3
## 15   GUECKEDOU 15 December 2014 21 December 2014 Week_50    NA
## 16      KANKAN 08 December 2014 14 December 2014 Week_50     2
## 17      KANKAN 15 December 2014 21 December 2014 Week_50    NA
## 18    KEROUANE 08 December 2014 14 December 2014 Week_50     9
## 19    KEROUANE 15 December 2014 21 December 2014 Week_50    NA
## 20      KINDIA 08 December 2014 14 December 2014 Week_50     1
## 21      KINDIA 15 December 2014 21 December 2014 Week_50    NA
## 22 KISSIDOUGOU 08 December 2014 14 December 2014 Week_50     2
## 23 KISSIDOUGOU 15 December 2014 21 December 2014 Week_50    NA
## 24   KOUROUSSA 08 December 2014 14 December 2014 Week_50     3
## 25        LOLA 08 December 2014 14 December 2014 Week_50     4
## 26        LOLA 15 December 2014 21 December 2014 Week_50    NA
## 27     MACENTA 08 December 2014 14 December 2014 Week_50     6
## 28     MACENTA 15 December 2014 21 December 2014 Week_50    NA
## 29  N'ZEREKORE 08 December 2014 14 December 2014 Week_50    13
## 30  N'ZEREKORE 15 December 2014 21 December 2014 Week_50    NA
## 31     SIGUIRI 08 December 2014 14 December 2014 Week_50     0
## 32    TELIMELE 08 December 2014 14 December 2014 Week_50    10
## 33    TELIMELE 15 December 2014 21 December 2014 Week_50    NA
## 34       BEYLA 15 December 2014 21 December 2014 Week_51     3
## 35     CONAKRY 08 December 2014 14 December 2014 Week_51    NA
## 36     CONAKRY 15 December 2014 21 December 2014 Week_51    39
## 37       COYAH 08 December 2014 14 December 2014 Week_51    NA
## 38       COYAH 15 December 2014 21 December 2014 Week_51    35
## 39      DABOLA 08 December 2014 14 December 2014 Week_51    NA
## 40      DABOLA 15 December 2014 21 December 2014 Week_51     1
## 41     DUBREKA 08 December 2014 14 December 2014 Week_51    NA
## 42     DUBREKA 15 December 2014 21 December 2014 Week_51     5
## 43     FARANAH 08 December 2014 14 December 2014 Week_51    NA
## 44  FORECARIAH 08 December 2014 14 December 2014 Week_51    NA
## 45  FORECARIAH 15 December 2014 21 December 2014 Week_51    15
## 46        FRIA 15 December 2014 21 December 2014 Week_51     1
## 47   GUECKEDOU 08 December 2014 14 December 2014 Week_51    NA
## 48   GUECKEDOU 15 December 2014 21 December 2014 Week_51     7
## 49      KANKAN 08 December 2014 14 December 2014 Week_51    NA
## 50      KANKAN 15 December 2014 21 December 2014 Week_51     2
## 51    KEROUANE 08 December 2014 14 December 2014 Week_51    NA
## 52    KEROUANE 15 December 2014 21 December 2014 Week_51    17
## 53      KINDIA 08 December 2014 14 December 2014 Week_51    NA
## 54      KINDIA 15 December 2014 21 December 2014 Week_51    21
## 55 KISSIDOUGOU 08 December 2014 14 December 2014 Week_51    NA
## 56 KISSIDOUGOU 15 December 2014 21 December 2014 Week_51   117
## 57   KOUROUSSA 08 December 2014 14 December 2014 Week_51    NA
## 58        LOLA 08 December 2014 14 December 2014 Week_51    NA
## 59        LOLA 15 December 2014 21 December 2014 Week_51    14
## 60     MACENTA 08 December 2014 14 December 2014 Week_51    NA
## 61     MACENTA 15 December 2014 21 December 2014 Week_51    24
## 62  N'ZEREKORE 08 December 2014 14 December 2014 Week_51    NA
## 63  N'ZEREKORE 15 December 2014 21 December 2014 Week_51    18
## 64     SIGUIRI 08 December 2014 14 December 2014 Week_51    NA
## 65    TELIMELE 08 December 2014 14 December 2014 Week_51    NA
## 66    TELIMELE 15 December 2014 21 December 2014 Week_51     6
# Modify the code
melt(ebola_wide, measure.vars = c("Week_50", "Week_51"), 
     variable.name = "period", value.name = "cases", id.vars="Location"
     )
##       Location  period cases
## 1        BEYLA Week_50    NA
## 2      CONAKRY Week_50    37
## 3      CONAKRY Week_50    NA
## 4        COYAH Week_50    31
## 5        COYAH Week_50    NA
## 6       DABOLA Week_50     2
## 7       DABOLA Week_50    NA
## 8      DUBREKA Week_50     6
## 9      DUBREKA Week_50    NA
## 10     FARANAH Week_50     0
## 11  FORECARIAH Week_50    22
## 12  FORECARIAH Week_50    NA
## 13        FRIA Week_50    NA
## 14   GUECKEDOU Week_50     3
## 15   GUECKEDOU Week_50    NA
## 16      KANKAN Week_50     2
## 17      KANKAN Week_50    NA
## 18    KEROUANE Week_50     9
## 19    KEROUANE Week_50    NA
## 20      KINDIA Week_50     1
## 21      KINDIA Week_50    NA
## 22 KISSIDOUGOU Week_50     2
## 23 KISSIDOUGOU Week_50    NA
## 24   KOUROUSSA Week_50     3
## 25        LOLA Week_50     4
## 26        LOLA Week_50    NA
## 27     MACENTA Week_50     6
## 28     MACENTA Week_50    NA
## 29  N'ZEREKORE Week_50    13
## 30  N'ZEREKORE Week_50    NA
## 31     SIGUIRI Week_50     0
## 32    TELIMELE Week_50    10
## 33    TELIMELE Week_50    NA
## 34       BEYLA Week_51     3
## 35     CONAKRY Week_51    NA
## 36     CONAKRY Week_51    39
## 37       COYAH Week_51    NA
## 38       COYAH Week_51    35
## 39      DABOLA Week_51    NA
## 40      DABOLA Week_51     1
## 41     DUBREKA Week_51    NA
## 42     DUBREKA Week_51     5
## 43     FARANAH Week_51    NA
## 44  FORECARIAH Week_51    NA
## 45  FORECARIAH Week_51    15
## 46        FRIA Week_51     1
## 47   GUECKEDOU Week_51    NA
## 48   GUECKEDOU Week_51     7
## 49      KANKAN Week_51    NA
## 50      KANKAN Week_51     2
## 51    KEROUANE Week_51    NA
## 52    KEROUANE Week_51    17
## 53      KINDIA Week_51    NA
## 54      KINDIA Week_51    21
## 55 KISSIDOUGOU Week_51    NA
## 56 KISSIDOUGOU Week_51   117
## 57   KOUROUSSA Week_51    NA
## 58        LOLA Week_51    NA
## 59        LOLA Week_51    14
## 60     MACENTA Week_51    NA
## 61     MACENTA Week_51    24
## 62  N'ZEREKORE Week_51    NA
## 63  N'ZEREKORE Week_51    18
## 64     SIGUIRI Week_51    NA
## 65    TELIMELE Week_51    NA
## 66    TELIMELE Week_51     6
gdp_oceania <- fread("./RInputFiles/gdp_and_pop_oceania.csv")
gdp_oceania$continent <- "Oceania"


# Split the population column by year
dcast(gdp_oceania, formula = country ~ year, value.var = "population")
##                   country     1990     1995     2000     2005     2010
##  1:             Australia 17096869 18124770 19107251 20274282 22162863
##  2:                  Fiji   728626   775498   811223   821820   859952
##  3:      French Polynesia   198370   215200   237267   254884   268065
##  4:              Kiribati    72411    77727    84406    92329   102648
##  5:      Marshall Islands    47300    51020    52161    52058    52428
##  6: Micronesia, Fed. Sts.    96331   107556   107430   106198   103619
##  7:         New Caledonia   168537   189198   209997   228683   246345
##  8:           New Zealand  3397534  3674886  3858234  4134699  4369027
##  9:                 Palau    15089    17255    19174    19907    20470
## 10:      Papua New Guinea  4157903  4715929  5374051  6086905  6847517
## 11:                 Samoa   162865   170158   174614   179928   186029
## 12:       Solomon Islands   311849   359236   412336   469306   526177
## 13:                 Tonga    95152    95889    97898   100858   103947
## 14:                Tuvalu     9004     9227     9419     9694     9827
## 15:               Vanuatu   146633   168236   185058   209375   236299
# Split the gdp column by country
dcast(gdp_oceania, formula = year ~ country, value.var = "gdp")
##    year    Australia       Fiji French Polynesia Kiribati Marshall Islands
## 1: 1990 300107784341 1335705790       2777827650 48534476        101112724
## 2: 1995 338759864212 1516326476       2928452782 51419787        131689273
## 3: 2000 414780916644 1683082744       3442160840 68561481        110971769
## 4: 2005 485146627070 1896890618               NA 72512766        123281092
## 5: 2010 558301132672 1907499910               NA 73245763        127781844
##    Micronesia, Fed. Sts. New Caledonia New Zealand     Palau
## 1:             182461073    2285808773 39503489723        NA
## 2:             235499455    2573368170 45171101764 109666344
## 3:             233938394    2641677234 51606891179 119875504
## 4:             233237060            NA 62729975739 127993766
## 5:             221126797            NA 63915448868 117841911
##    Papua New Guinea     Samoa Solomon Islands     Tonga   Tuvalu   Vanuatu
## 1:       2356993537 194099795       343423086 142132664 10157211 194634278
## 2:       3559227888 204930258       504296634 171182064 11798530 228542854
## 3:       3518223421 242925503       438937728 188551996 13741847 271991177
## 4:       3812278808 313462373       457857482 213317400 13825499 283875173
## 5:       5095994376 329190409       601834699 215089852 15329960 359737826
# Reshape from wide to long format
wide <- dcast(gdp_oceania, formula = country ~ year, value.var = c("gdp", "population"))

# convert to a matrix
as.matrix(wide, rownames="country")
##                           gdp_1990     gdp_1995     gdp_2000     gdp_2005
## Australia             300107784341 338759864212 414780916644 485146627070
## Fiji                    1335705790   1516326476   1683082744   1896890618
## French Polynesia        2777827650   2928452782   3442160840           NA
## Kiribati                  48534476     51419787     68561481     72512766
## Marshall Islands         101112724    131689273    110971769    123281092
## Micronesia, Fed. Sts.    182461073    235499455    233938394    233237060
## New Caledonia           2285808773   2573368170   2641677234           NA
## New Zealand            39503489723  45171101764  51606891179  62729975739
## Palau                           NA    109666344    119875504    127993766
## Papua New Guinea        2356993537   3559227888   3518223421   3812278808
## Samoa                    194099795    204930258    242925503    313462373
## Solomon Islands          343423086    504296634    438937728    457857482
## Tonga                    142132664    171182064    188551996    213317400
## Tuvalu                    10157211     11798530     13741847     13825499
## Vanuatu                  194634278    228542854    271991177    283875173
##                           gdp_2010 population_1990 population_1995
## Australia             558301132672        17096869        18124770
## Fiji                    1907499910          728626          775498
## French Polynesia                NA          198370          215200
## Kiribati                  73245763           72411           77727
## Marshall Islands         127781844           47300           51020
## Micronesia, Fed. Sts.    221126797           96331          107556
## New Caledonia                   NA          168537          189198
## New Zealand            63915448868         3397534         3674886
## Palau                    117841911           15089           17255
## Papua New Guinea        5095994376         4157903         4715929
## Samoa                    329190409          162865          170158
## Solomon Islands          601834699          311849          359236
## Tonga                    215089852           95152           95889
## Tuvalu                    15329960            9004            9227
## Vanuatu                  359737826          146633          168236
##                       population_2000 population_2005 population_2010
## Australia                    19107251        20274282        22162863
## Fiji                           811223          821820          859952
## French Polynesia               237267          254884          268065
## Kiribati                        84406           92329          102648
## Marshall Islands                52161           52058           52428
## Micronesia, Fed. Sts.          107430          106198          103619
## New Caledonia                  209997          228683          246345
## New Zealand                   3858234         4134699         4369027
## Palau                           19174           19907           20470
## Papua New Guinea              5374051         6086905         6847517
## Samoa                          174614          179928          186029
## Solomon Islands                412336          469306          526177
## Tonga                           97898          100858          103947
## Tuvalu                           9419            9694            9827
## Vanuatu                        185058          209375          236299
# Modify your previous code
dcast(gdp_oceania, formula = continent + country ~ year, value.var = c("gdp", "population"))
##     continent               country     gdp_1990     gdp_1995     gdp_2000
##  1:   Oceania             Australia 300107784341 338759864212 414780916644
##  2:   Oceania                  Fiji   1335705790   1516326476   1683082744
##  3:   Oceania      French Polynesia   2777827650   2928452782   3442160840
##  4:   Oceania              Kiribati     48534476     51419787     68561481
##  5:   Oceania      Marshall Islands    101112724    131689273    110971769
##  6:   Oceania Micronesia, Fed. Sts.    182461073    235499455    233938394
##  7:   Oceania         New Caledonia   2285808773   2573368170   2641677234
##  8:   Oceania           New Zealand  39503489723  45171101764  51606891179
##  9:   Oceania                 Palau           NA    109666344    119875504
## 10:   Oceania      Papua New Guinea   2356993537   3559227888   3518223421
## 11:   Oceania                 Samoa    194099795    204930258    242925503
## 12:   Oceania       Solomon Islands    343423086    504296634    438937728
## 13:   Oceania                 Tonga    142132664    171182064    188551996
## 14:   Oceania                Tuvalu     10157211     11798530     13741847
## 15:   Oceania               Vanuatu    194634278    228542854    271991177
##         gdp_2005     gdp_2010 population_1990 population_1995
##  1: 485146627070 558301132672        17096869        18124770
##  2:   1896890618   1907499910          728626          775498
##  3:           NA           NA          198370          215200
##  4:     72512766     73245763           72411           77727
##  5:    123281092    127781844           47300           51020
##  6:    233237060    221126797           96331          107556
##  7:           NA           NA          168537          189198
##  8:  62729975739  63915448868         3397534         3674886
##  9:    127993766    117841911           15089           17255
## 10:   3812278808   5095994376         4157903         4715929
## 11:    313462373    329190409          162865          170158
## 12:    457857482    601834699          311849          359236
## 13:    213317400    215089852           95152           95889
## 14:     13825499     15329960            9004            9227
## 15:    283875173    359737826          146633          168236
##     population_2000 population_2005 population_2010
##  1:        19107251        20274282        22162863
##  2:          811223          821820          859952
##  3:          237267          254884          268065
##  4:           84406           92329          102648
##  5:           52161           52058           52428
##  6:          107430          106198          103619
##  7:          209997          228683          246345
##  8:         3858234         4134699         4369027
##  9:           19174           19907           20470
## 10:         5374051         6086905         6847517
## 11:          174614          179928          186029
## 12:          412336          469306          526177
## 13:           97898          100858          103947
## 14:            9419            9694            9827
## 15:          185058          209375          236299
gdp_by_industry_oceania <- fread("./RInputFiles/gdp_by_industry_oceania.tsv")


# Split gdp by industry and year
gdp_by_industry_oceania
##                   country year population    industry          gdp
##  1:             Australia 1995   18124770 agriculture 203255918527
##  2:             Australia 1995   18124770     tourism  16937993211
##  3:             Australia 2010   22162863 agriculture 251235509702
##  4:             Australia 2010   22162863     tourism  83745169901
##  5:                  Fiji 1995     775498 agriculture    909795885
##  6:                  Fiji 1995     775498     tourism     75816324
##  7:                  Fiji 2010     859952 agriculture    858374960
##  8:                  Fiji 2010     859952     tourism    286124987
##  9:      French Polynesia 1995     215200 agriculture   1757071669
## 10:      French Polynesia 1995     215200     tourism    146422639
## 11:      French Polynesia 2010     268065 agriculture         <NA>
## 12:      French Polynesia 2010     268065     tourism         <NA>
## 13:              Kiribati 1995      77727 agriculture     30851872
## 14:              Kiribati 1995      77727     tourism      2570989
## 15:              Kiribati 2010     102648 agriculture     32960593
## 16:              Kiribati 2010     102648     tourism     10986864
## 17:      Marshall Islands 1995      51020 agriculture     79013564
## 18:      Marshall Islands 1995      51020     tourism      6584464
## 19:      Marshall Islands 2010      52428 agriculture     57501830
## 20:      Marshall Islands 2010      52428     tourism     19167277
## 21: Micronesia, Fed. Sts. 1995     107556 agriculture    141299673
## 22: Micronesia, Fed. Sts. 1995     107556     tourism     11774973
## 23: Micronesia, Fed. Sts. 2010     103619 agriculture     99507059
## 24: Micronesia, Fed. Sts. 2010     103619     tourism     33169020
## 25:         New Caledonia 1995     189198 agriculture   1544020902
## 26:         New Caledonia 1995     189198     tourism    128668409
## 27:         New Caledonia 2010     246345 agriculture         <NA>
## 28:         New Caledonia 2010     246345     tourism         <NA>
## 29:           New Zealand 1995    3674886 agriculture  27102661058
## 30:           New Zealand 1995    3674886     tourism   2258555088
## 31:           New Zealand 2010    4369027 agriculture  28761951991
## 32:           New Zealand 2010    4369027     tourism   9587317330
## 33:                 Palau 1995      17255 agriculture     65799806
## 34:                 Palau 1995      17255     tourism      5483317
## 35:                 Palau 2010      20470 agriculture     53028860
## 36:                 Palau 2010      20470     tourism     17676287
## 37:      Papua New Guinea 1995    4715929 agriculture   2135536733
## 38:      Papua New Guinea 1995    4715929     tourism    177961394
## 39:      Papua New Guinea 2010    6847517 agriculture   2293197469
## 40:      Papua New Guinea 2010    6847517     tourism    764399156
## 41:                 Samoa 1995     170158 agriculture    122958155
## 42:                 Samoa 1995     170158     tourism     10246513
## 43:                 Samoa 2010     186029 agriculture    148135684
## 44:                 Samoa 2010     186029     tourism     49378561
## 45:       Solomon Islands 1995     359236 agriculture    302577980
## 46:       Solomon Islands 1995     359236     tourism     25214832
## 47:       Solomon Islands 2010     526177 agriculture    270825615
## 48:       Solomon Islands 2010     526177     tourism     90275205
## 49:                 Tonga 1995      95889 agriculture    102709238
## 50:                 Tonga 1995      95889     tourism      8559103
## 51:                 Tonga 2010     103947 agriculture     96790433
## 52:                 Tonga 2010     103947     tourism     32263478
## 53:                Tuvalu 1995       9227 agriculture      7079118
## 54:                Tuvalu 1995       9227     tourism       589927
## 55:                Tuvalu 2010       9827 agriculture      6898482
## 56:                Tuvalu 2010       9827     tourism      2299494
## 57:               Vanuatu 1995     168236 agriculture    137125712
## 58:               Vanuatu 1995     168236     tourism     11427143
## 59:               Vanuatu 2010     236299 agriculture    161882022
## 60:               Vanuatu 2010     236299     tourism     53960674
##                   country year population    industry          gdp
dcast(gdp_by_industry_oceania, formula = country ~ industry + year, value.var=c("gdp"))
##                   country agriculture_1995 agriculture_2010 tourism_1995
##  1:             Australia     203255918527     251235509702  16937993211
##  2:                  Fiji        909795885        858374960     75816324
##  3:      French Polynesia       1757071669             <NA>    146422639
##  4:              Kiribati         30851872         32960593      2570989
##  5:      Marshall Islands         79013564         57501830      6584464
##  6: Micronesia, Fed. Sts.        141299673         99507059     11774973
##  7:         New Caledonia       1544020902             <NA>    128668409
##  8:           New Zealand      27102661058      28761951991   2258555088
##  9:                 Palau         65799806         53028860      5483317
## 10:      Papua New Guinea       2135536733       2293197469    177961394
## 11:                 Samoa        122958155        148135684     10246513
## 12:       Solomon Islands        302577980        270825615     25214832
## 13:                 Tonga        102709238         96790433      8559103
## 14:                Tuvalu          7079118          6898482       589927
## 15:               Vanuatu        137125712        161882022     11427143
##     tourism_2010
##  1:  83745169901
##  2:    286124987
##  3:         <NA>
##  4:     10986864
##  5:     19167277
##  6:     33169020
##  7:         <NA>
##  8:   9587317330
##  9:     17676287
## 10:    764399156
## 11:     49378561
## 12:     90275205
## 13:     32263478
## 14:      2299494
## 15:     53960674

Fraud Detection in R

Chapter 1 - Introduction and Motivation

Introduction and Motivation:

  • Fraud is an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types and forms
  • Fraud is rare, but the cost of not detecting fraud is very large
  • Fraud detection models need to have multiple charcteristics
    • Statistical accuracy
    • Interpretability
    • Regulatory compliance
    • Economical impact - total cost of ownership and ROI
    • Complement expert-based findings
  • Multiple challenges with fraud detection
    • Imbalanced samples - credit card fraud is typically less than 0.5% (“needle in a haystack” problem)
    • Operational efficiency (e.g., credit card decisions need to be made in ~8 seconds)
    • Avoid harassing good customers over legitimate transactions
  • Confusion matrices are commonly used - true positive, true negative, false positive, false negative
    • caret::confusionMatrix(data = predictions, reference = fraud_label)
    • Accuracy is not a good measure when there is an imbalanced sample

Time features:

  • Time is an important aspect of fraud detection; transactions tend to occur at similar hours
    • There is no perfect natural ordering to time though, since time is cricular; arithmetic mean does not accout for the periodicity
  • Can create a circular histogram based off the time series data
    • ts <- as.numeric(lubridate::hms(timestamps)) / 3600
    • clock <- ggplot(data.frame(ts), aes(x = ts)) + geom_histogram(breaks = seq(0, 24), colour = “blue”, fill = “lightblue”) + coord_polar()
    • arithmetic_mean <- mean(ts)
    • clock + geom_vline(xintercept = arithmetic_mean, linetype = 2, color = “red”, size = 2)
  • Can instead use the von Mises distribution for modeling time (variable wrapped around a circle)
    • mu - periodic mean
    • kappa - 1/k is the periodic variance (k is a measure of concentration)
    • library(circular)
    • ts <- circular(ts, units = “hours”, template = “clock24”)
    • estimates <- mle.vonmises(ts)
    • p_mean <- estimates$mu %% 24
    • concentration <- estimates$kappa
  • Can then obtain a binary feature as to whether a given timestamp falls inside the interval
    • densities <- dvonmises(ts, mu = p_mean, kappa = concentration)
    • alpha <- 0.90
    • quantile <- qvonmises((1 - alpha)/2, mu = p_mean, kappa = concentration) %% 24
    • cutoff <- dvonmises(quantile, mu = p_mean, kappa = concentration)
    • time_feature <- densities >= cutoff
  • Can also run confidence intervals with rolling windows
    • time_feature = c(NA, NA)
    • for (i in 3:length(ts)) {
    • ts_history <- ts[1:(i-1)]
    • estimates <- mle.vonmises(ts_history)
    • p_mean <- estimates$mu %% 24
    • concentration <- estimates$kappa
    • dens_i <- dvonmises(ts[i], mu = p_mean, kappa = concentration)
    • alpha <- 0.90
    • quantile <- qvonmises((1-alpha)/2, mu=p_mean, kappa=concentration) %% 24
    • cutoff <- dvonmises(quantile, mu = p_mean, kappa = concentration)
    • time_feature[i] <- dens_i >= cutoff
    • }

Frequency features:

  • May need to add features for a good-performing fraud detection algorithm
    • trans %>% select(fraud_flag, orig_account_id, benef_country, authentication_cd, channel_cd, amount)
  • For example, may want to look at the authentication methods by person and associations to fraud
    • trans <- trans %>% arrange(timestamp)
    • trans_Alice <- trans %>% filter(account_name == “Alice”)
    • frequency_fun <- function(steps, auth_method) {
    • n <- length(steps)
    • frequency <- sum(auth_method[1:n] == auth_method[n + 1])
    • return(frequency)
    • }
    • freq_auth <- zoo::rollapply(trans_Alice\(transfer_id, width=list(-1:-length(trans_Alice\)transfer_id)), partial = TRUE, FUN = frequency_fun, trans_Alice$authentication_cd)
    • freq_auth <- c(0, freq_auth)
  • Can run a similar process for multiple accounts
    • trans %>% group_by(account_name)
    • trans <- trans %>% group_by(account_name) %>% mutate(freq_auth = c(0, zoo::rollapplyr(transfer_id, width = list(-1:-length(transfer_id)), partial = TRUE, FUN = count_fun, authentication_cd) ) )

Recency features:

  • Recency features capture the dimension of time in a manner designed to highlight potential fraud
    • Example of an authentication method that has not recently been used
    • Recency=0 means not used recently (or never used before), while recency=1 means used recently
    • recency=exp(-gamma*time) # gamma is a tuning parameter, typically a small number
    • Gamma is often chosen such that recency will be 0.01 after 180 days
  • Can create the recency feature in R
    • recency_fun <- function(t, gamma, auth_cd, freq_auth) {
    • n_t <- length(t)
    • if (freq_auth[n_t] == 0) { recency <- 0 } else {
    •   time_diff <- t[1] - max(t[2:n_t][auth_cd[(n_t-1):1] == auth_cd[n_t]])
    •   recency <- exp(-gamma * time_diff)  
    • }
    • return(recency)
    • }
  • Can then calculate gamma and apply to each individual account
    • gamma <- -log(0.01)/180 # = 0.0256
    • trans <- trans %>%
    • group_by(account_name) %>%
    • mutate(rec_auth = zoo::rollapply(timestamp, width = list(0:-length(transfer_id)), partial = TRUE, FUN = recency_fun, gamma, authentication_cd, freq_auth))

Example code includes:

load("./RInputFiles/transfers02_v2.RData")  # data.frame transfers is 628x12

# Print the first 6 rows of the dataset
head(transfers)
##     fraud_flag transfer_id timestamp orig_account_id benef_account_id
## 226          0   xtr215694  21966103       X27769025        X86111129
## 141          0   xtr671675  40885557       X15452684        X63932196
## 493          0   xtr977348  19945191       X96278924        X56011266
## 240          0   xtr655123  27404301       X27769025        X95653232
## 445          0   xtr785302   6566236       X96278924        X85352318
## 88           0   xtr187306  17576922       X15452684        X18544316
##     benef_country channel_cd authentication_cd communication_cd
## 226         ISO03       CH01              AU02            COM02
## 141         ISO03       CH03              AU02            COM02
## 493         ISO03       CH04              AU05            COM02
## 240         ISO03       CH01              AU04            COM02
## 445         ISO04       CH07              AU04            COM01
## 88          ISO03       CH03              AU02            COM01
##     empty_communication_flag orig_balance_before amount
## 226                        0                5412     33
## 141                        0                7268     40
## 493                        0                1971    227
## 240                        0               10603     20
## 445                        0                6228   5176
## 88                         0                4933     54
# Display the structure of the dataset
str(transfers)
## 'data.frame':    628 obs. of  12 variables:
##  $ fraud_flag              : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ transfer_id             : Factor w/ 785 levels "xtr402538","xtr517350",..: 226 141 493 240 445 88 714 475 97 132 ...
##  $ timestamp               : num  21966103 40885557 19945191 27404301 6566236 ...
##  $ orig_account_id         : Factor w/ 9 levels "X51735094","X15452684",..: 3 2 6 3 6 2 8 6 2 2 ...
##  $ benef_account_id        : Factor w/ 301 levels "X71039384","X82542502",..: 78 31 199 87 180 11 276 179 28 11 ...
##  $ benef_country           : Factor w/ 4 levels "ISO03","ISO02",..: 1 1 1 1 4 1 1 1 1 1 ...
##  $ channel_cd              : Factor w/ 7 levels "CH07","CH05",..: 7 3 6 7 1 3 7 6 3 3 ...
##  $ authentication_cd       : Factor w/ 5 levels "AU02","AU04",..: 1 1 3 2 2 1 2 3 1 1 ...
##  $ communication_cd        : Factor w/ 3 levels "COM02","COM03",..: 1 1 1 1 3 3 3 3 1 1 ...
##  $ empty_communication_flag: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ orig_balance_before     : num  5412 7268 1971 10603 6228 ...
##  $ amount                  : num  33 40 227 20 5176 ...
# Determine fraction of legitimate and fraudulent cases
class_distribution <- prop.table(table(transfers$fraud_flag))
print(class_distribution)
## 
##          0          1 
## 0.97770701 0.02229299
# Make pie chart of column fraud_flag
df <- data.frame(class = c("no fraud", "fraud"), pct = as.numeric(class_distribution)) %>%
    mutate(class = factor(class, levels = c("no fraud", "fraud")),
           cumulative = cumsum(pct), midpoint = cumulative - pct / 2,
           label = paste0(class, " ", round(pct*100, 2), "%")
           )

ggplot(df, aes(x = 1, weight = pct, fill = class)) +
    scale_fill_manual(values = c("dodgerblue", "red")) +
    geom_bar(width = 1, position = "stack") +
    coord_polar(theta = "y") +
    geom_text(aes(x = 1.3, y = midpoint, label = label)) +
    ggmap::theme_nothing()

# Create vector predictions containing 0 for every transfer
predictions <- factor(rep(0, nrow(transfers)), levels = c(0, 1))

# Compute confusion matrix
caret::confusionMatrix(data = predictions, reference = transfers$fraud_flag)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 614  14
##          1   0   0
##                                           
##                Accuracy : 0.9777          
##                  95% CI : (0.9629, 0.9878)
##     No Information Rate : 0.9777          
##     P-Value [Acc > NIR] : 0.570441        
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.000512        
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9777          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9777          
##          Detection Rate : 0.9777          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 
# Compute cost of not detecting fraud
cost <- sum(transfers$amount[transfers$fraud_flag == 1])
print(cost)
## [1] 64410
# load("./RInputFiles/timestamps_circular.RData")  # 'circular' num[1:25] ts
load("./RInputFiles/timestamps_digital.RData")  # chr[1:25] timestamps

# Convert the plain text to hours
ts <- as.numeric(lubridate::hms(timestamps)) / 3600

# Convert the data to class circular
ts <- circular::circular(ts, units = "hours", template = "clock24")

# Plot a circular histogram
clock <- ggplot(data.frame(ts), aes(x = ts)) +
    geom_histogram(breaks = seq(0, 24), colour = "blue", fill = "lightblue") +
    coord_polar() + 
    scale_x_continuous("", limits = c(0, 24), breaks = seq(0, 24))
plot(clock)

# Create the von Mises distribution estimates
estimates <- circular::mle.vonmises(ts)

# Extract the periodic mean from the estimates
p_mean <- estimates$mu %% 24

# Add the periodic mean to the circular histogram
clock <- ggplot(data.frame(ts), aes(x = ts)) +
    geom_histogram(breaks = seq(0, 24), colour = "blue", fill = "lightblue") +
    coord_polar() + 
    scale_x_continuous("", limits = c(0, 24), breaks = seq(0, 24)) +
    geom_vline(xintercept = as.numeric(p_mean), color = "red", linetype = 2, size = 1.5)
plot(clock)

# Estimate the periodic mean and concentration on the first 24 timestamps
alpha <- 0.95
p_mean <- estimates$mu %% 24
concentration <- estimates$kappa

# Estimate densities of all 25 timestamps
densities <- circular::dvonmises(ts, mu = p_mean, kappa = concentration)

# Check if the densities are larger than the cutoff of 95%-CI
cutoff <- circular::dvonmises(circular::qvonmises((1 - alpha)/2, mu = p_mean, kappa = concentration), 
                              mu = p_mean, kappa = concentration
                              )

# Define the variable time_feature
time_feature <- densities >= cutoff
print(cbind.data.frame(ts, time_feature))
##           ts time_feature
## 1   8.730000         TRUE
## 2   9.297778         TRUE
## 3  12.939444         TRUE
## 4  12.458889         TRUE
## 5  10.989722         TRUE
## 6   7.379167         TRUE
## 7  11.233056         TRUE
## 8  10.223889         TRUE
## 9  10.116944         TRUE
## 10  6.165556         TRUE
## 11 12.721389         TRUE
## 12  7.126389         TRUE
## 13  9.612222         TRUE
## 14 10.750000         TRUE
## 15  8.460000         TRUE
## 16  7.926389         TRUE
## 17 11.548889         TRUE
## 18 13.309722         TRUE
## 19 11.164167         TRUE
## 20  9.775833         TRUE
## 21  6.986667         TRUE
## 22 10.326667         TRUE
## 23  9.663056         TRUE
## 24  9.662778         TRUE
## 25 18.398333        FALSE
load("./RInputFiles/transfers_Bob.RData")  # data.frame trans_Bob 17x12

# Frequency feature based on channel_cd
frequency_fun <- function(steps, channel) {
    n <- length(steps)
    frequency <- sum(channel[1:n] == channel[n+1])
    return(frequency)
}

# Create freq_channel feature
freq_channel <- zoo::rollapply(trans_Bob$transfer_id, width = list(-1:-length(trans_Bob$transfer_id)),
                               partial = TRUE, FUN = frequency_fun, trans_Bob$channel_cd
                               )

# Print the features channel_cd, freq_channel and fraud_flag next to each other
freq_channel <- c(0, freq_channel)
cbind.data.frame(trans_Bob$channel_cd, freq_channel, trans_Bob$fraud_flag)
##    trans_Bob$channel_cd freq_channel trans_Bob$fraud_flag
## 1                  CH07            0                    0
## 2                  CH07            1                    0
## 3                  CH06            0                    0
## 4                  CH06            1                    0
## 5                  CH07            2                    0
## 6                  CH02            0                    0
## 7                  CH06            2                    0
## 8                  CH07            3                    0
## 9                  CH07            4                    0
## 10                 CH07            5                    0
## 11                 CH02            1                    0
## 12                 CH07            6                    0
## 13                 CH06            3                    0
## 14                 CH02            2                    0
## 15                 CH07            7                    0
## 16                 CH06            4                    0
## 17                 CH05            0                    1
# load("./RInputFiles/transfers_AliceBob.RData")  # data.frame trans 40x12
load("./RInputFiles/transfers_AliceBob_freq.RData")  # data.frame trans 40x14

# Group the data
trans <- trans %>% group_by(account_name) %>%
    # Mutate the data to add a new feature
    mutate(freq_channel = c(0, zoo::rollapply(transfer_id, width = list(-1:-length(transfer_id)),
                                              partial = TRUE, FUN = frequency_fun, channel_cd
                                              )
                            )
           )

# Print the features as columns next to each other
as.data.frame(trans %>% select(account_name, channel_cd, freq_channel, fraud_flag))
##    account_name channel_cd freq_channel fraud_flag
## 1           Bob       CH07            0          0
## 2         Alice       CH04            0          0
## 3           Bob       CH07            1          0
## 4           Bob       CH06            0          0
## 5         Alice       CH07            0          0
## 6           Bob       CH06            1          0
## 7         Alice       CH04            1          0
## 8           Bob       CH07            2          0
## 9         Alice       CH01            0          0
## 10          Bob       CH02            0          0
## 11          Bob       CH06            2          0
## 12        Alice       CH03            0          0
## 13        Alice       CH04            2          0
## 14        Alice       CH04            3          0
## 15        Alice       CH04            4          0
## 16          Bob       CH07            3          0
## 17          Bob       CH07            4          0
## 18          Bob       CH07            5          0
## 19        Alice       CH01            1          0
## 20        Alice       CH04            5          0
## 21          Bob       CH02            1          0
## 22        Alice       CH04            6          0
## 23        Alice       CH04            7          0
## 24          Bob       CH07            6          0
## 25        Alice       CH04            8          0
## 26          Bob       CH06            3          0
## 27        Alice       CH04            9          0
## 28        Alice       CH04           10          0
## 29        Alice       CH04           11          0
## 30          Bob       CH02            2          0
## 31        Alice       CH04           12          0
## 32        Alice       CH01            2          0
## 33        Alice       CH01            3          0
## 34          Bob       CH07            7          0
## 35        Alice       CH04           13          0
## 36        Alice       CH04           14          0
## 37          Bob       CH06            4          0
## 38        Alice       CH04           15          0
## 39          Bob       CH05            0          1
## 40        Alice       CH05            0          1
# Create the recency function
recency_fun <- function(t, gamma, channel_cd, freq_channel) {
    n_t <- length(t)
    # If the channel has never been used, return 0 else, return the exponent
    if (freq_channel[n_t] == 0) { 
        return(0) 
    } else {
        time_diff <- t[1] - max(t[2:n_t][channel_cd[(n_t-1):1] == channel_cd[n_t]])
        exponent <- -gamma * time_diff
        return(exp(exponent))
    }
}

# Group, mutate and rollapply
gamma <- -log(0.01)/90
trans <- trans %>% 
    group_by(account_name) %>%
    mutate(rec_channel = zoo::rollapply(timestamp, width = list(0:-length(transfer_id)), 
                                        partial = TRUE, FUN = recency_fun, 
                                        gamma, channel_cd, freq_channel
                                        )
           )

# Print a new dataframe
as.data.frame(trans %>% 
                  select(account_name, channel_cd, timestamp, rec_channel, fraud_flag)
              )
##    account_name channel_cd timestamp  rec_channel fraud_flag
## 1           Bob       CH07   3823030 0.0000000000          0
## 2         Alice       CH04   4675604 0.0000000000          0
## 3           Bob       CH07   4963503 0.0000000000          0
## 4           Bob       CH06   5554880 0.0000000000          0
## 5         Alice       CH07   5554975 0.0000000000          0
## 6           Bob       CH06   5555011 0.0012271252          0
## 7         Alice       CH04   6069408 0.0000000000          0
## 8           Bob       CH07   6069430 0.0000000000          0
## 9         Alice       CH01   6400316 0.0000000000          0
## 10          Bob       CH02   6400428 0.0000000000          0
## 11          Bob       CH06   6400499 0.0000000000          0
## 12        Alice       CH03   7251861 0.0000000000          0
## 13        Alice       CH04   8312166 0.0000000000          0
## 14        Alice       CH04   8312333 0.0001944862          0
## 15        Alice       CH04   8488444 0.0000000000          0
## 16          Bob       CH07   9440864 0.0000000000          0
## 17          Bob       CH07  10704152 0.0000000000          0
## 18          Bob       CH07  13473757 0.0000000000          0
## 19        Alice       CH01  14775404 0.0000000000          0
## 20        Alice       CH04  19606163 0.0000000000          0
## 21          Bob       CH02  24787430 0.0000000000          0
## 22        Alice       CH04  24787532 0.0000000000          0
## 23        Alice       CH04  24787661 0.0013593564          0
## 24          Bob       CH07  24989386 0.0000000000          0
## 25        Alice       CH04  24989464 0.0000000000          0
## 26          Bob       CH06  27471273 0.0000000000          0
## 27        Alice       CH04  29049288 0.0000000000          0
## 28        Alice       CH04  30230242 0.0000000000          0
## 29        Alice       CH04  30230331 0.0105250029          0
## 30          Bob       CH02  32397147 0.0000000000          0
## 31        Alice       CH04  32397147 0.0000000000          0
## 32        Alice       CH01  32397148 0.0000000000          0
## 33        Alice       CH01  32737015 0.0000000000          0
## 34          Bob       CH07  35169990 0.0000000000          0
## 35        Alice       CH04  35178888 0.0000000000          0
## 36        Alice       CH04  35178952 0.0378248991          0
## 37          Bob       CH06  35179134 0.0000000000          0
## 38        Alice       CH04  36302300 0.0000000000          0
## 39          Bob       CH05  38132166 0.0000000000          1
## 40        Alice       CH05  38296341 0.0000000000          1
load("./RInputFiles/transfers_chap1_L4.RData")  # data.frame transfers 222x16

# Statistics of frequency & recency features of legitimate transactions:
summary(transfers %>% 
            filter(fraud_flag==0) %>% 
            select(freq_channel, freq_auth, rec_channel, rec_auth)
        )
##   freq_channel      freq_auth      rec_channel        rec_auth     
##  Min.   :  0.00   Min.   : 0.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:  6.25   1st Qu.: 4.00   1st Qu.:0.7083   1st Qu.:0.5506  
##  Median : 28.50   Median :12.00   Median :0.8894   Median :0.8462  
##  Mean   : 45.52   Mean   :20.29   Mean   :0.7766   Mean   :0.7176  
##  3rd Qu.: 82.75   3rd Qu.:26.75   3rd Qu.:0.9999   3rd Qu.:0.9999  
##  Max.   :137.00   Max.   :81.00   Max.   :1.0000   Max.   :1.0000
# Statistics of frequency & recency features of fraudulent transactions:
summary(transfers %>% 
            filter(fraud_flag==1) %>% 
            select(freq_channel, freq_auth, rec_channel, rec_auth)
        )
##   freq_channel    freq_auth      rec_channel         rec_auth      
##  Min.   : 0.0   Min.   : 1.00   Min.   :0.00000   Min.   :0.01699  
##  1st Qu.: 0.0   1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:0.02704  
##  Median : 2.0   Median : 6.00   Median :0.02124   Median :0.03644  
##  Mean   : 3.5   Mean   : 9.75   Mean   :0.03705   Mean   :0.24854  
##  3rd Qu.: 5.5   3rd Qu.:11.75   3rd Qu.:0.05830   3rd Qu.:0.25794  
##  Max.   :10.0   Max.   :26.00   Max.   :0.10573   Max.   :0.90430

Chapter 2 - Social Network Analytics

Social network analytics:

  • Social networks include both nodes and edges
  • Edges could include money transfers, and may be weighted based on frequency, amount, intensity, etc.
    • Negative weights are rare, but can be used to show animosity
    • Incoming and outgoing edges can be reflected using a directed network
  • Connectivity matrices can be valuable; for example, square matrix of 1s and 0s
  • Adjacency lists can also be used as a list of form (Node1, Node2, weight)
  • Can move from a transactional database to a network
    • network <- igraph::graph_from_data_frame(transactions, directed = FALSE)
    • plot(network)
    • E(network)
    • V(network)
    • V(network)$name
    • plot(net)
    • E(net)$width <- count.multiple(net)
    • edge_attr(net)
    • E(net)$curved <- FALSE
    • plot(net)

Fraud and social network analytics:

  • Social network analytics can help improve fraud detection
  • Fraudsters tend to cluster together; same activities, resources, victims, etc.
  • Homophily is the concept that fraudsters are more likely to be connected to fraudsters
  • Social networks can be helpful for finding identity theft
    • Legitimate customers tend to call their frequent contacts
    • Fraudsters tend to call different people
  • Money mules transfer money illegally; can potentially be identified by nodes
    • Assign colors to node by status, then add information to the network

Social network based inference:

  • Can attempt to predict the behavior of a node based on the nodes elsewhere in the network
    • Non-relational models - only use local information (logit, decision trees)
    • Relational models - network links included (relational neighbor classifiers)
  • Assuming that there is homophily, then relational neighbor classifiers may be appropriate
    • Can be as simple as proportion of known links belonging to each class (may be weighted by the strength of the edge)
    • subnetwork <- subgraph(network, v = c(“?”, “B”, “D”))
    • prob_fraud <- strength(subnetwork, v = “?”) / strength(network, v = “?”)
    • prob_fraud

Social network metrics:

  • The geodesic is the shortest path between two nodes (may include weights in the calculation)
  • Generally, the closer to a fraudulent node, the greater the potential impact
  • The maximum degree possible for a network with N nodes is N-1 (normalizing means dividing the degree by N-1)
  • Closeness is the inverse of the sum of the distance of a node to all other nodes in the network
    • Closeness is always at most 1 / (N-1) so normalizing is to multiply by (N-1)
  • Betweenness is the number of times that a node is in the geodesic for two other nodes
    • betweenness(network)
    • betweenness(network, normalized = TRUE)

Example code includes:

load("./RInputFiles/network_data.RData")  # data.frame transfers 60x6 and data.frame account_info 38x2


library(igraph)

# Have a look at the data
head(transfers)
##   originator beneficiary     amount  time benef_country payment_channel
## 1        I47         I87 1463.72327 15:12           CAN         CHAN_01
## 2        I40         I61  143.26357 15:40           GBR         CHAN_01
## 3        I89         I61   53.32169 11:44           GBR         CHAN_05
## 4        I24         I52  226.27792 14:55           GBR         CHAN_03
## 5        I40         I87 1151.49827 21:20           CAN         CHAN_03
## 6        I63         I54  110.43880 20:21           GBR         CHAN_03
# Create an undirected network from the dataset
net <- graph_from_data_frame(transfers, directed = FALSE)

# Plot the network with the vertex labels in bold and black
plot(net, vertex.label.color = "black", vertex.label.font = 2)

load("./RInputFiles/network_data_simple.RData")  # data.frame edges 16x2

# Create a network from the data frame
net <- graph_from_data_frame(edges, directed = FALSE)

# Plot the network with the multiple edges
plot(net, layout = layout_in_circle)

# Specify new edge attributes width and curved
E(net)$width <- count.multiple(net)
E(net)$curved <- FALSE

# Check the new edge attributes and plot the network with overlapping edges
edge_attr(net)
## $width
##  [1] 7 7 7 7 7 7 7 1 1 1 4 4 4 4 1 1
## 
## $curved
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE
plot(net, layout = layout_in_circle)

# Create an undirected network from the dataset
net <- graph_from_data_frame(transfers, directed = FALSE)

# Add account_type as an attribute to the nodes of the network
V(net)$account_type <- account_info$type

# Have a look at the vertex attributes
print(vertex_attr(net))
## $name
##  [1] "I47" "I40" "I89" "I24" "I63" "I28" "I44" "I23" "I41" "I93" "I52"
## [12] "I25" "I69" "I15" "I21" "I77" "I76" "I17" "I81" "I37" "I11" "I87"
## [23] "I61" "I54" "I80" "I20" "I64" "I46" "I19" "I55" "I14" "I30" "I29"
## [34] "I35" "I27" "I60" "I22" "I66"
## 
## $account_type
##  [1] 3 3 1 2 2 1 1 1 3 2 2 1 2 1 2 1 2 1 1 3 3 3 1 1 1 3 2 2 2 1 1 2 2 2 2
## [36] 1 1 2
# Check for homophily based on account_type
assortativity_nominal(net, types = V(net)$account_type, directed = FALSE)
## [1] 0.1810621
# Each account type is assigned a color
vertex_colors <- c("grey", "lightblue", "darkorange")

# Add attribute color to V(net) which holds the color of each node depending on its account_type
V(net)$color <- vertex_colors[V(net)$account_type]

# Plot the network
plot(net)

load("./RInputFiles/network_data_v2.RData")  # data.frame transfers 60x6 and data.frame account_info 38x3

# From data frame to graph
net <- graph_from_data_frame(transfers, directed = FALSE)

# Plot the network; color nodes according to isMoneyMule-variable
V(net)$color <- ifelse(account_info$isMoneyMule, "darkorange", "slateblue1")
plot(net, vertex.label.color = "black", vertex.label.font = 2, vertex.size = 18)

# Find the id of the money mule accounts
print(account_info$id[account_info$isMoneyMule])
## [1] I47 I41 I87 I20
## 38 Levels: I11 I14 I15 I17 I19 I20 I21 I22 I23 I24 I25 I27 I28 I29 ... I93
# Create subgraph containing node "I41" and all money mules nodes
subnet <- induced_subgraph(net, v = c("I41", "I47", "I87", "I20"))

# Compute the money mule probability of node "I41" based on the neighbors
strength(subnet, v="I41") / strength(net, v="I41")
## I41 
## 0.6
load("./RInputFiles/kite.RData")  # list[1:10] kite

# Plot network kite
plot(kite)

# Find the degree of each node
degree(kite)
##  [1] 4 4 3 6 3 5 5 3 2 1
# Which node has the largest degree?
which.max(degree(kite))
## [1] 4
# Plot kite with vertex.size proportional to the degree of each node
plot(kite, vertex.size = 6 * degree(kite))

# Find the closeness of each node
closeness(kite)
##  [1] 0.05882353 0.05882353 0.05555556 0.06666667 0.05555556 0.07142857
##  [7] 0.07142857 0.06666667 0.04761905 0.03448276
# Which node has the largest closeness?
which.max(closeness(kite))
## [1] 6
# Plot kite with vertex.size proportional to the closeness of each node
plot(kite, vertex.size = 500 * closeness(kite))

# Find the betweenness of each node
betweenness(kite)
##  [1]  0.8333333  0.8333333  0.0000000  3.6666667  0.0000000  8.3333333
##  [7]  8.3333333 14.0000000  8.0000000  0.0000000
# Which node has the largest betweenness?
which.max(betweenness(kite))
## [1] 8
# Plot kite with vertex.size proportional to the betweenness of each node
plot(kite, vertex.size = 5 * betweenness(kite))

# Plot network and print account info
plot(net)
legend("bottomleft", legend = c("known money mule", "legit account"), 
       fill = c("darkorange", "lightblue"), bty = "n"
       )

print(account_info)
##     id isMoneyMule type
## 1  I47        TRUE    3
## 2  I40       FALSE    3
## 3  I89       FALSE    1
## 4  I24       FALSE    2
## 5  I63       FALSE    2
## 6  I28       FALSE    1
## 7  I44       FALSE    1
## 8  I23       FALSE    1
## 9  I41        TRUE    3
## 10 I93       FALSE    2
## 11 I52       FALSE    2
## 12 I25       FALSE    1
## 13 I69       FALSE    2
## 14 I15       FALSE    1
## 15 I21       FALSE    2
## 16 I77       FALSE    1
## 17 I76       FALSE    2
## 18 I17       FALSE    1
## 19 I81       FALSE    1
## 20 I37       FALSE    3
## 21 I11       FALSE    3
## 22 I87        TRUE    3
## 23 I61       FALSE    1
## 24 I54       FALSE    1
## 25 I80       FALSE    1
## 26 I20        TRUE    3
## 27 I64       FALSE    2
## 28 I46       FALSE    2
## 29 I19       FALSE    2
## 30 I55       FALSE    1
## 31 I14       FALSE    1
## 32 I30       FALSE    2
## 33 I29       FALSE    2
## 34 I35       FALSE    2
## 35 I27       FALSE    2
## 36 I60       FALSE    1
## 37 I22       FALSE    1
## 38 I66       FALSE    2
# Degree
account_info$degree <- degree(net, normalized = TRUE)

# Closeness
account_info$closeness <- closeness(net, normalized = TRUE)

# Betweenness
account_info$betweenness <- betweenness(net, normalized = TRUE)

print(account_info)
##     id isMoneyMule type     degree closeness  betweenness
## 1  I47        TRUE    3 0.08108108 0.3775510 0.0000000000
## 2  I40       FALSE    3 0.16216216 0.3425926 0.1979479479
## 3  I89       FALSE    1 0.05405405 0.2587413 0.0000000000
## 4  I24       FALSE    2 0.18918919 0.3737374 0.2600100100
## 5  I63       FALSE    2 0.13513514 0.2983871 0.0350350350
## 6  I28       FALSE    1 0.21621622 0.3627451 0.2942942943
## 7  I44       FALSE    1 0.18918919 0.3663366 0.2209709710
## 8  I23       FALSE    1 0.21621622 0.3592233 0.2757757758
## 9  I41        TRUE    3 0.13513514 0.4352941 0.2817817818
## 10 I93       FALSE    2 0.08108108 0.3274336 0.0838338338
## 11 I52       FALSE    2 0.05405405 0.2846154 0.0000000000
## 12 I25       FALSE    1 0.08108108 0.3008130 0.0347847848
## 13 I69       FALSE    2 0.08108108 0.2781955 0.0000000000
## 14 I15       FALSE    1 0.08108108 0.2720588 0.0007507508
## 15 I21       FALSE    2 0.05405405 0.2700730 0.0000000000
## 16 I77       FALSE    1 0.08108108 0.2700730 0.0000000000
## 17 I76       FALSE    2 0.08108108 0.2700730 0.0000000000
## 18 I17       FALSE    1 0.05405405 0.3032787 0.0315315315
## 19 I81       FALSE    1 0.08108108 0.2720588 0.0007507508
## 20 I37       FALSE    3 0.05405405 0.3217391 0.0155155155
## 21 I11       FALSE    3 0.08108108 0.3663366 0.0950950951
## 22 I87        TRUE    3 0.16216216 0.4352941 0.3753753754
## 23 I61       FALSE    1 0.05405405 0.2587413 0.0000000000
## 24 I54       FALSE    1 0.05405405 0.2761194 0.0000000000
## 25 I80       FALSE    1 0.02702703 0.2569444 0.0000000000
## 26 I20        TRUE    3 0.13513514 0.4157303 0.2309809810
## 27 I64       FALSE    2 0.08108108 0.3057851 0.0385385385
## 28 I46       FALSE    2 0.05405405 0.2700730 0.0000000000
## 29 I19       FALSE    2 0.02702703 0.2661871 0.0000000000
## 30 I55       FALSE    1 0.05405405 0.2700730 0.0000000000
## 31 I14       FALSE    1 0.08108108 0.2781955 0.0000000000
## 32 I30       FALSE    2 0.05405405 0.2700730 0.0000000000
## 33 I29       FALSE    2 0.08108108 0.2700730 0.0000000000
## 34 I35       FALSE    2 0.02702703 0.2700730 0.0000000000
## 35 I27       FALSE    2 0.02702703 0.2740741 0.0000000000
## 36 I60       FALSE    1 0.02702703 0.2661871 0.0000000000
## 37 I22       FALSE    1 0.02702703 0.2740741 0.0000000000
## 38 I66       FALSE    2 0.02702703 0.2740741 0.0000000000

Chapter 3 - Imbalanced Class Distributions

Dealing with imbalanced datasets:

  • Key challenge is labelling events as fraud or not (classification or anomaly detection)
  • Meaningful risk of a model that sees a very large class and classifies everything to that large class
  • Preference for a balanced distribution; tends to drive improved model performance
    • Increase (over-sample) the number of fraud (minority) cases
    • Decrease (under-sample) the number of normal (majority) cases
  • Example of random over-sampling of the dataset
    • Increase the minority (fraud) cases in the training sample through randomly over-sampling the fraud data
    • ROSE package: Random Over-Sampling Examples
    • ovun.sample() for random over-sampling, under-sampling or combination!
    • n_legit <- 24108
    • new_frac_legit <- 0.50
    • new_n_total <- n_legit/new_frac_legit # = 21408/0.50 = 42816
    • library(ROSE)
    • oversampling_result <- ovun.sample(Class ~ ., data = creditcard, method = “over”, N = new_n_total, seed = 2018)
    • oversampled_credit <- oversampling_result$data
    • table(oversampled_credit$Class)

Random under-sampling:

  • Can instead change the class ditribution through random under-sampling
    • Remove some of the non-fraud cases rather than double-counting some of the fraud cases
  • Can again use ovun.sample for this task
    • new_n_total <- n_fraud/new_frac_fraud # = 492/0.50 = 984
    • undersampling_result <- ovun.sample(Class ~ ., data = creditcard, method = “under”, N = new_n_total, seed = 2018)
  • Can also combine both over-sampling and under-sampling
    • n_new <- nrow(creditcard) # = 24600
    • fraction_fraud_new <- 0.50
    • sampling_result <- ovun.sample(Class ~ ., data = creditcard, method = “both”, N = n_new, p = fraction_fraud_new, seed = 2018)
    • sampled_credit <- sampling_result$data

Synthetic minority over-sampling:

  • SMOTE - Synthetic Minority Oversampling Technique
    • Find the nearest k fraudulent neighbors of a minority class data point
    • Randomly choose one of the k neighbors
    • Choose a random number between 0 and 1
    • Linear combination of the minority class data point and the randomly chosen neighbor, using the random number above as the weighting
    • Repeat the process “dup_size” number of times
  • Can run the process using the “smotefamily” library
    • library(smotefamily)
    • smote_output = SMOTE(X = transfer_data[, -1], target = transfer_data$isFraud, K = 4, dup_size = 10)
    • oversampled_data = smote_output$data

From dataset to detection model:

  • General SMOTE modeling process
    • Divide data in to train and test
    • Run SMOTE on the training data only
    • Train the model on the SMOTE-balanced dataset
    • Test performance on the test dataset
  • Example of running using SMOTE
    • smote_result = SMOTE(X = train[, -17], target = train$Class, K = 5, dup_size = 10)
    • train_oversampled = smote_result$data
    • colnames(train_oversampled)[17] = “Class”
    • model2 = rpart::rpart(Class ~ ., data = train_oversampled)
    • scores2 = predict(model2, newdata = test, type = “prob”)[, 2]
    • predicted_class2 = factor(ifelse(scores2 > 0.5, 1, 0))
    • CM2 = caret::confusionMatrix(data = predicted_class2, reference = test$Class)
    • auc(roc(response = test$Class, predictor = scores2))
  • Modeling can be based on costs - associated with fraud, and associated with wrongly flagging fraud
    • cost_model <- function(predicted.classes, true.classes, amounts, fixedcost) {
    • cost <- sum(true.classes * (1 - predicted.classes) * amounts + predicted.classes * fixedcost)
    • return(cost)
    • }
    • cost_model(predicted_class1, test\(Class, test\)Amount, fixedcost = 10)
    • cost_model(predicted_class2, test\(Class, test\)Amount, fixedcost = 10)

Example code includes:

load("./RInputFiles/transfers02_v2.RData")  # data.frame transfers is 628x12

# Make a scatter plot
ggplot(transfers, aes(x = amount, y = orig_balance_before)) +
    geom_point(aes(color = fraud_flag, shape = fraud_flag)) +
    scale_color_manual(values = c('dodgerblue', 'red'))

load("./RInputFiles/creditcard5.RData")  # data.frame creditcard is 9840x32

# Calculate the required number of cases in the over-sampled dataset
n_new <- sum(creditcard$Class==0) / (1-0.3333)

# Over-sample
oversampling_result <- ROSE::ovun.sample(formula = Class ~ ., data = creditcard, 
                                         method = "over", N = n_new, seed = 2018
                                         )

# Verify the Class-balance of the over-sampled dataset
oversampled_credit <- oversampling_result$data
prop.table(table(oversampled_credit$Class))
## 
##         0         1 
## 0.6667142 0.3332858
# Calculate the required number of cases in the over-sampled dataset
n_new <- sum(creditcard$Class == 1) / (0.4)

# Under-sample
undersampling_result <- ROSE::ovun.sample(formula = Class ~ ., data = creditcard,
                                          method = "under", N = n_new, seed = 2018
                                          )

# Verify the Class-balance of the under-sampled dataset
undersampled_credit <- undersampling_result$data
prop.table(table(undersampled_credit$Class))
## 
##   0   1 
## 0.6 0.4
# Specify the desired number of cases in the balanced dataset and the fraction of fraud cases
n_new <- 10000
fraud_fraction <- 0.3

# Combine ROS & RUS!
sampling_result <- ROSE::ovun.sample(formula = Class ~ ., data = creditcard, method = "both", 
                                     N = n_new,  p = fraud_fraction, seed = 2018
                                     )

# Verify the Class-balance of the re-balanced dataset
sampled_credit <- sampling_result$data
prop.table(table(sampled_credit$Class))
## 
##      0      1 
## 0.6984 0.3016
# Set the number of fraud and legitimate cases, and the desired percentage of legitimate cases
n0 <- sum(creditcard$Class==0)
n1 <- sum(creditcard$Class==1)
r0 <- 0.6

# Calculate the value for the dup_size parameter of SMOTE
ntimes <- ((1 - r0) / r0) * (n0 / n1) - 1

# Create synthetic fraud cases with SMOTE
smote_output <- smotefamily::SMOTE(X = creditcard[ , -c(1, 31, 32)], target = creditcard$Class, 
                                   K = 5, dup_size = ntimes
                                   )

# Make a scatter plot of the original and over-sampled dataset
credit_smote <- smote_output$data
colnames(credit_smote)[30] <- "Class"
prop.table(table(credit_smote$Class))
## 
##         0         1 
## 0.6129032 0.3870968
ggplot(creditcard, aes(x = V1, y = V2, color = Class)) +
    geom_point() +
    scale_color_manual(values = c('dodgerblue2', 'red'))

ggplot(credit_smote, aes(x = V1, y = V2, color = Class)) +
    geom_point() +
    scale_color_manual(values = c('dodgerblue2', 'red'))

set.seed(1903172344)
testIdx <- sort(sample(1:nrow(creditcard), round(0.5*nrow(creditcard)), replace=FALSE))
test <- creditcard[testIdx, ]
train_original <- creditcard[-testIdx, ]

n_new <- 7380
fraud_fraction <- 0.3
train_oversampled <- ROSE::ovun.sample(formula = Class ~ ., data = train_original, 
                                       method = "over", N = n_new, seed = 2018
                                       )$data

# Train the rpart algorithm on the original training set and the SMOTE-rebalanced training set
model_orig <- rpart::rpart(Class ~ ., data = train_original)
model_smote <- rpart::rpart(Class ~ ., data = train_oversampled)

# Predict the fraud probabilities of the test cases
scores_orig <- predict(model_orig, newdata = test, type = "prob")[, 2]
scores_smote <- predict(model_smote, newdata = test, type = "prob")[, 2]

# Convert the probabilities to classes (0 or 1) using a cutoff value
predicted_class_orig <- factor(ifelse(scores_orig > 0.5, 1, 0))
predicted_class_smote <- factor(ifelse(scores_smote > 0.5, 1, 0))

# Determine the confusion matrices and the model's accuracy
CM_orig <- caret::confusionMatrix(data = predicted_class_orig, reference = test$Class)
CM_smote <- caret::confusionMatrix(data = predicted_class_smote, reference = test$Class)
print(CM_orig$table)
##           Reference
## Prediction    0    1
##          0 4666   47
##          1    3  204
print(CM_orig$overall[1])
##  Accuracy 
## 0.9898374
print(CM_smote$table)
##           Reference
## Prediction    0    1
##          0 4577   38
##          1   92  213
print(CM_smote$overall[1])
##  Accuracy 
## 0.9735772
cost_model <- function(predicted.classes, true.classes, amounts, fixedcost) {
    predicted.classes <- hmeasure::relabel(predicted.classes)
    true.classes <- hmeasure::relabel(true.classes)
    cost <- sum(true.classes * (1 - predicted.classes) * amounts + predicted.classes * fixedcost)
    return(cost)
}

# Calculate the total cost of deploying the original model
cost_model(predicted_class_orig, test$Class, test$Amount, fixedcost=10)
## [1] 9923.82
# Calculate the total cost of deploying the model using SMOTE
cost_model(predicted_class_smote, test$Class, test$Amount, fixedcost=10)
## [1] 11391.11

Chapter 4 - Digit Analysis and Robust Statistics

Digit analysis using Benford’s Law:

  • Benford’s Law considers the left-most digit in a number
    • The distribution will often have 30% 1s with only 4.6% 9s
    • Generally, frequencies will be proportional to log10(1 + 1/d) where d is the digit
    • Pinkham discovered that Benford’s Law is scale invariant (e.g., Euros to pesos should preserve the distribution)
  • Example of Benford’s Law in code
    • benlaw <- function(d) log10(1 + 1 / d)
    • df <- data.frame(digit = 1:9, probability = benlaw(1:9))
    • ggplot(df, aes(x = digit, y = probability)) + geom_bar(stat = “identity”, fill = “dodgerblue”) + xlab(“First digit”) + ylab(“Expected frequency”) + scale_x_continuous(breaks = 1:9, labels = 1:9) + ylim(0, 0.33) + theme(text = element_text(size = 25))
    • n <- 1000
    • fibnum <- numeric(len)
    • fibnum[1] <- 1
    • fibnum[2] <- 1
    • for (i in 3:n) { fibnum[i] <- fibnum[i-1]+fibnum[i-2] }
    • pow2 <- 2^(1:n)
    • library(benford.analysis)
    • bfd.fib <- benford(fibnum, number.of.digits = 1)
    • plot(bfd.fib)
    • bfd.pow2 <- benford(pow2, number.of.digits = 1)
    • plot(bfd.pow2)

Benford’s Law for fraud detection:

  • Many datasets follow Benford’s law
    • data where numbers represent sizes of facts or events
    • data in which numbers have no relationship to each other
    • data sets that grow exponentially or arise from multiplicative fluctuations
    • mixtures of different data sets
    • Some well-known infinite integer sequences
  • Fraud is typically committed by changing numbers, and changes often fail to conform with Benford’s Law
  • Caution that many types of data can never be expected to comply with Benford’s Law
    • If there is lower and/or upper bound or data is concentrated in narrow interval, e.g. hourly wage rate, height of people
    • If numbers are used as identification numbers or labels, e.g. social security number, flight numbers, car license plate numbers, phone numbers
    • Additive fluctuations instead of multiplicative fluctuations, e.g. heartbeats on a given day
  • Benford’s Law can be applied for the first two digits of a series of numbers
    • P(d1d2) = log10(1 + 1/(d1d2)) # note that this is NOT multiplicative; d1d2 for d1=1 and d2=2 is 12, not 2
    • bfd.cen <- benford(census.2009$pop.2009,number.of.digits = 2)
    • plot(bfd.cen)

Detecting univariate outliers:

  • Can use global statistics to detect outliers; not all outliers are fraudulent, so follow-up and validation are needed
  • One popular tool for outlier detection is the z-score - (x - mu) / sigma
    • One challenge is that the sample mean and sample standard deviation are artificially driven by an extreme outlier
  • “Robust statistics” attempt to flag outliers better than the classical z-scores
    • Classical statistical methods rely on (normality) assumptions, but even single outlier can influence conclusions significantly and may lead to misleading results
    • Robust statistics produce also reliable results when data contains outliers and yield automatic outlier detection tools
    • It is perfect to use both classical and robust methods routinely, and only worry when they differ enough to matter… But when they differ, you should think hard. J.W. Tukey (1979)
  • The median is more robust than the mean, while the median-absolute-deviation (MAD) and IQR are more robust than the standard deviation
  • The Boxplot is a common way to identify outliers
    • ggplot(data.frame(los), aes(x = “”, y = los)) + geom_boxplot(outlier.colour = “red”, outlier.shape = 16, outlier.size = 3, fill = “lightblue”, width = 0.5) + xlab(“”) + ylab(“Length Of Stay (LOS)”) + theme(text = element_text(size = 25))
  • The boxplot has assumptions of normality, even though in reality data may not be normally distributed
    • At asymmetric distributions, boxplot may flag many regular points as outliers
    • The skewness-adjusted boxplot corrects for this by using a robust measure of skewness in determining the fence
    • library(robustbase)
    • adjbox_stats <- adjboxStats(los)$stats
    • ggplot(data.frame(los), aes(x = “”, y = los)) +
    • stat_boxplot(geom = “errorbar”, width = 0.2, coef = 1.5exp(3mc(los))) +
    • geom_boxplot(ymin = adjbox_stats[1], ymax = adjbox_stats[5], middle = adjbox_stats[3], upper = adjbox_stats[4], lower = adjbox_stats[2], outlier.shape = NA, fill = “lightblue”, width = 0.5) +
    • geom_point(data=subset(data.frame(los), los < adjbox_stats[1] | los > adjbox_stats[5]), col = “red”, size = 3, shape = 16) + xlab(“”) +
    • ylab(“Length Of Stay (LOS)”) +
    • theme(text = element_text(size = 25))
    • adjbox(los,col=“lightblue”, ylab=“LOS data”)$out

Detecting multivariate outliers:

  • Multivariate outliers have unusual combinations of data across multiple dimensions
  • Example of animals data - brain vs. body
    • X <- data.frame(body = log(Animals\(body), brain = log(Animals\)brain))
    • fig <- ggplot(X, aes(x = body, y = brain)) + geom_point(size = 5) + xlab(“log(body)”) + ylab(“log(brain)”) + ylim(-5, 15) + scale_x_continuous(limits = c(-10, 16), breaks = seq(-15, 15, 5)))
  • Mahalanobis distance is a multidimensional distance (distance on each axis is scaled based on length of ellipse in that direction
    • Classical Mahalanobis distances : sample mean as estimate for location and sample covariance matrix as estimate for scatter
    • To detect multivariate outliers the mahalanobis distance is compared with a cut-off value, which is derived from the chisquare distribution
    • In two dimensions we can construct corresponding 97.5% tolerance ellipsoid, which is defined by those observations whose Mahalanobis distance does not exceed the cut-off value
  • Extending the Mahalanobis distance to the animal data
    • animals.clcenter <- colMeans(X)
    • animals.clcov <- cov(X)
    • rad <- sqrt(qchisq(0.975, df = ncol(X)))
    • library(car)
    • ellipse.cl <- data.frame(ellipse(center = animals.clcenter, shape = animals.clcov,radius = rad, segments = 100, draw = FALSE))
    • colnames(ellipse.cl) <- colnames(X)
    • fig <- fig + geom_polygon(data=ellipse.cl, color = “dodgerblue”, fill = “dodgerblue”, alpha = 0.2) + geom_point(aes(x = animals.clcenter[1], y = animals.clcenter[2]), color = “blue”, size = 6)
    • fig
  • Can improve the Mahalanobis distance with a more robust estimate of location and scatter
    • Minimum Covariance Determinant (MCD) estimator of Rousseeuw is a popular robust estimator of multivariate location and scatter
    • MCD looks for those hh observations whose classical covariance matrix has the lowest possible determinant
    • MCD estimate of location is then mean of these hh observations
    • MCD estimate of scatter is then sample covariance matrix of these hh points (multiplied by consistency factor)
    • Reweighting step is applied to improve efficiency at normal data
    • Computation of MCD is difficult, but several fast algorithms are proposed
  • Examples of using robust statistics on the animals data
    • library(robustbase)
    • animals.mcd <- covMcd(X)
    • animals.mcd$center
    • animals.mcd$cov
    • library(robustbase)
    • animals.mcd <- covMcd(X)
    • ellipse.mcd <- data.frame(ellipse(center = animals.mcd\(center, shape = animals.mcd\)cov, radius=rad, segments=100, draw=FALSE))
    • colnames(ellipse.mcd) <- colnames(X)
    • fig <- fig + geom_polygon(data=ellipse.mcd, color=“red”, fill=“red”, alpha=0.3) + geom_point(aes(x = animals.mcd\(center[1], y = animals.mcd\)center[2]), color = “red”, size = 6)
    • fig
  • The distance-distance plot is a common alternative when the number of dimensions is 4+
    • When p>3 it is not possible to visualize the tolerance ellipsoid
    • The distance-distance plot shows the robust distance of each observation versus its classical Mahalanobis distance, obtained immediately from MCD object
    • plot(animals.mcd, which = “dd”)

Example code includes:

# Implement Benford's Law for first digit
benlaw <- function(d) log10(1 + 1 / d)

# Calculate expected frequency for d=5
benlaw(d=5)
## [1] 0.07918125
# Create a dataframe of the 9 digits and their Benford's Law probabilities
df <- data.frame(digit = 1:9, probability = benlaw(1:9))

# Create barplot with expected frequencies
ggplot(df, aes(x = digit, y = probability)) + 
    geom_bar(stat = "identity", fill = "dodgerblue") + 
    xlab("First digit") + 
    ylab("Expected frequency") + 
    scale_x_continuous(breaks = 1:9, labels = 1:9) + 
    ylim(0, 0.33) + 
    theme(text = element_text(size = 25))

data(census.2009, package="benford.analysis")

# Check conformity
bfd.cen <- benford.analysis::benford(census.2009$pop.2009, number.of.digits = 1) 
plot(bfd.cen, except = c("second order", "summation", "mantissa", "chi squared", 
                         "abs diff", "ex summation", "Legend"), 
     multiple = F
     ) 

# Multiply the data by 3 and check conformity again
data <- census.2009$pop.2009 * 3
bfd.cen3 <- benford.analysis::benford(data, number.of.digits=1)
plot(bfd.cen3, except = c("second order", "summation", "mantissa", "chi squared", 
                          "abs diff", "ex summation", "Legend"), 
     multiple = F
     )

load("./RInputFiles/fireinsuranceclaims.RData")  # num[1:40000] fireinsuranceclaims

# Validate data against Benford's Law using first digit
bfd.ins <- benford.analysis::benford(fireinsuranceclaims, number.of.digits = 1) 
plot(bfd.ins, except=c("second order", "summation", "mantissa", "chi squared",
                       "abs diff", "ex summation", "Legend"), 
     multiple = F
     )

# Validate data against Benford's Law using first-two digits
bfd.ins2 <- benford.analysis::benford(fireinsuranceclaims, number.of.digits = 2)
plot(bfd.ins2, except=c("second order", "summation", "mantissa", "chi squared",
                        "abs diff", "ex summation", "Legend"), 
     multiple = F
     )

load("./RInputFiles/expensesCEO.RData")  # num[1:988] expensesCEO

# Validate data against Benford's Law using first digit
bfd.exp <- benford.analysis::benford(expensesCEO, number.of.digits = 1) 
plot(bfd.exp, except=c("second order", "summation", "mantissa", "chi squared",
                       "abs diff", "ex summation", "Legend"), 
     multiple = F
     )

# Validate data against Benford's Law using first-two digits
bfd.exp2 <- benford.analysis::benford(expensesCEO, number.of.digits = 2) 
plot(bfd.exp2, except=c("second order", "summation", "mantissa", "chi squared",
                        "abs diff", "ex summation", "Legend"), 
     multiple = F
     )

load("./RInputFiles/transfers_chap1_L4.RData")  # data.frame transfers 222x16

# Get observations identified as fraud
which(transfers$fraud_flag == 1)
## [1]  71 198 220 222
# Compute median and mean absolute deviation for `amount`
m <- median(transfers$amount)
s <- mad(transfers$amount)

# Compute robust z-score for each observation
robzscore <- abs((transfers$amount - m) / (s))

# Get observations with robust z-score higher than 3 in absolute value
which(abs(robzscore) > 3)
##  [1]   1   7  14  18  31  36  43  44  48  49  50  53  63  71  79  91 112
## [18] 113 116 122 123 124 132 152 153 162 168 170 177 191 198 205 214 219
## [35] 220 222
thexp <- c(40517, 33541, 5182, 40385, 40302, 23189, 13503, 5110, 15754, 40763, 23061, 30839, 25206, 15891, 38821, 11766, 4934, 13754, 14142, 27813, 21005, 11511, 41750, 32855, 62043, 19415, 51815, 26961, 19185, 19704, 21831, 40768, 49079, 12766, 13030, 8841, 17943, 6214, 21114, 7898, 30707, 69698, 70155, 15032, 55858, 31747, 11562, 12390, 7016, 96396, 24614, 22735, 20483, 36907, 31822, 13619, 34401, 10281, 32165, 52226, 13941, 40850, 15270, 21143, 26029, 10209, 10950, 19745, 54153, 33668, 7562, 34231, 34219, 25784, 52952, 32959, 17459, 25611, 14998, 36229, 26485, 20563, 41865, 29821, 26792, 42406, 20083, 10205, 31353, 33674, 13523, 51835, 18136, 54736, 33499, 95389, 44967, 67707, 40879, 17729, 15643, 15648, 19150, 9789, 27978, 40469, 30696, 48195, 12817, 10527, 42946, 72281, 13773, 17189, 14340, 47962, 29063, 34477, 84354, 37943, 13584, 12184, 49563, 36263, 18313, 25399, 50235, 14230, 25617, 18226, 31542, 24262, 17617, 22068, 43534, 14574, 6471, 57500, 8535, 85065, 22749, 10481, 42094, 24436, 27975, 28347, 32929, 20106, 30992, 22202, 17005, 29900, 16871, 10790, 38355, 10315, 7782, 16084, 11788, 20005, 70859, 21706, 11929, 69816, 6351, 27217, 30178, 10597, 13715, 13687, 17116, 27426, 56579, 31655, 86577, 27051, 10477, 11178, 49785, 12626, 44817, 15758, 21396, 5590, 40538, 38834, 32693, 47330, 17823, 92957, 44439, 27188, 22972, 40020, 33067, 24562, 8408, 36088, 8823, 8022, 36395, 14523, 49188, 19744, 17536, 32456, 38400, 6451, 31766, 24727, 60013, 15664, 17356, 50482, 20752, 28048, 10932, 35337, 18755, 6572, 16065, 67257, 36303, 14846, 50468, 27237, 7165, 38067, 22040, 23794, 106032, 19303, 63934, 16818, 11621, 40566, 14921, 15188, 28087, 21026, 38907, 39727, 49794, 49112, 6886, 31674, 25053, 23835, 12160, 45640, 7898, 107065, 58206, 8270, 69529, 25776, 57742, 18456, 53523, 27514, 14089, 7291, 4727, 7319, 22650, 8462, 14980, 39085, 13627, 14998, 29686, 22750, 25487, 40127, 22844, 7597, 19265, 14869, 33010, 74958, 30320, 16602, 36376, 16467, 26946, 26870, 33433, 61134, 20121, 58389, 61594, 24150, 32594, 20177, 3133, 22330, 16905, 25053, 10837, 20807, 6647, 29696, 34457, 78286, 21551, 26533, 52237, 58745, 19607, 25062, 23823, 108756, 12067, 28813, 17803, 35795, 20860, 14307, 11991, 71409, 74111, 25916, 25914, 14092, 24780, 43417, 12767, 43919, 28205, 34075, 68173, 28509, 78760, 52329, 31858, 54088, 6670, 29371, 30066, 16554, 13866, 13806, 40504, 49841, 19729, 30881, 35484, 11373, 10624, 7544, 49465, 12499, 18316, 27963, 31601, 52243, 48927, 42339, 34707, 13034, 9452, 69461, 64335, 9964, 8993, 11217, 116262, 16693, 28622, 35251, 8587, 28414, 15552, 49460, 57721, 16398, 20597, 28592, 13795, 19534, 37065, 22847, 10679, 37552, 63611, 12023, 19659, 16040, 63519, 56897, 55381, 44055, 65130, 8251, 4857, 6225, 74197, 20986, 47412, 35014, 39263, 19486, 22648, 7826, 52697, 9196, 31890, 8836, 62723, 60814, 42018, 20671, 7444, 21584, 13213, 23959, 16394, 21742, 71587, 16866, 63788, 39375, 37087, 9370, 37440, 10666, 60304, 34391, 55988, 9921, 11709, 69198, 49491, 39201, 10065, 32063, 10380, 18818, 35576, 15695, 30336, 74993, 48089, 24255, 56063, 11932, 23251, 9537, 7757, 67473, 44949, 11842, 12681, 36156, 71957, 37576, 36120, 53232, 6743, 36142, 62565, 49525, 33161, 28272, 24179, 15227, 46151, 32764, 32374, 22462, 9593, 9442, 13202, 24634, 35284, 44800, 11044, 45655, 5502, 18124, 18007, 39303, 14012, 26558, 10926, 25389, 16447, 8720, 12977, 12942, 25020, 6430, 18919, 24916, 55667, 51081, 99834, 31840, 147498)
thexp <- c(thexp, 27913, 40044, 24278, 41614, 50038, 54307, 27598, 18022, 34789, 24503, 51856, 12160, 19107, 40845, 46171, 66460, 22241, 15245, 10925, 17671, 5666, 25305, 9394, 20357, 32948, 7942, 14555, 24012, 25235, 17292, 81646, 46737, 15012, 49861, 10012, 15076, 7693, 18513, 46293, 50770, 36122, 25598, 8633, 15568, 16954, 47036, 38076, 18458, 8092, 38576, 28692, 27211, 37485, 15162, 32968, 55021, 7060, 16714, 34581, 14939, 27056, 15090, 56905, 29528, 21282, 39487, 24239, 35466, 21982, 10334, 15133, 41591, 23260, 12882, 32149, 16219, 41605, 8346, 8549, 23818, 36217, 42766, 11239, 59532, 31806, 52218, 73118, 31701, 32761, 18745, 17949, 8017, 12833, 25583, 36468, 8706, 94587, 42900, 74298, 17201, 25618, 14888, 16308, 75043, 68056, 50797, 15956, 13820, 13985, 22742, 17692, 30214, 57582, 17273, 31885, 14307, 22597, 46389, 23366, 35128, 51769, 9251, 35663, 34474, 18748, 60091, 31137, 14366, 25347, 32175, 16065, 16672, 45192, 42039, 19665, 24933, 29570, 23400, 13517, 23993, 18140, 9545, 16042, 24425, 28400, 25035, 28316, 19001, 27203, 8016, 15199, 14069, 12037, 30455, 13877, 10696, 11010, 41384, 37241, 38328, 54434, 27174, 14015, 27354, 12944, 19718, 21558, 22239, 31076, 32940, 17810, 13462, 16122, 20417, 36205, 9871, 11892, 50737, 32511, 29767, 17032, 40276, 24005, 33884, 16278, 11326, 7187, 50434, 70436, 38353, 27723, 32811, 14833, 28465, 83158, 18866, 21823, 39125, 45372, 33933, 29469, 79147, 24190, 37007, 4259, 41346, 25087, 27216, 32780, 21190, 29067, 11316, 36103, 15389, 27257, 26051, 22853, 10551, 6661, 15878, 17131, 18220, 12045, 10573, 34645, 19517, 13933, 14452, 33694, 35605, 48376, 19567, 35762, 12931, 6286, 25321, 22167, 24243, 19433, 26852, 25802, 26647, 26423, 11537, 16011, 56547, 22690, 20391, 10487, 16994, 25690, 19260, 57525, 17802, 28135, 14365, 21640, 20817, 55771, 31693, 31859, 27496, 39715, 22775, 27933, 58875, 39133, 7604, 29409, 29296, 44377, 33107, 21235, 59129, 33427, 10164, 14595, 4744, 49674, 27827, 48830, 36196, 24979, 47800, 108752, 52300, 38343, 19381, 35881, 43688, 32938, 13341, 29297, 38603, 30202, 14797, 38529, 29055, 61303, 27109, 53496, 16665, 65132, 23903, 36096, 21247, 42292, 11176, 7542, 15210, 5289, 58444, 33295, 30456, 60595, 59624, 19642, 13317, 9262, 17611, 35079, 45469, 59510, 26852, 51484, 20195, 27751, 33555, 27692, 70407, 18102, 130773, 16637, 60463, 11653, 19275, 47114, 6117, 29645, 57846, 51033, 11790, 24970, 32391, 19278, 27778, 19596, 17761, 56884, 66230, 40617, 80495, 27704, 22815, 23390, 18092, 13037, 27954, 6979, 6942, 46155, 34240, 24484, 22375, 45916, 32788, 28017, 31922, 25357, 7314)

# Create boxplot
bp.thexp <- boxplot(thexp, col = "lightblue", main = "Standard boxplot", 
                    ylab = "Total household expenditure"
                    )

# Extract the outliers from the data
bp.thexp$out
##  [1]  96396  95389  84354  85065  86577  92957 106032 107065  74958  78286
## [11] 108756  74111  78760 116262  74197  74993  99834 147498  81646  94587
## [21]  74298  75043  83158  79147 108752 130773  80495
# Create adjusted boxplot
adj.thexp <- robustbase::adjbox(thexp, col = "lightblue", main = "Adjusted boxplot", 
                                ylab = "Total household expenditure"
                                )

load("./RInputFiles/hailinsurance.RData")  # matrix num[1:100, 1:2] hailinsurance

# Create a scatterplot
plot(hailinsurance, xlab = "price house", ylab = "claim")

# Compute the sample mean and sample covariance matrix
clcenter <- colMeans(hailinsurance)
clcov <- cov(hailinsurance)

# Add 97.5% tolerance ellipsoid
rad <- sqrt(qchisq(0.975, df=ncol(hailinsurance)))
car::ellipse(center = clcenter, shape = clcov, radius = rad,col = "blue", lty = 2)

# Create a scatterplot of the data
plot(hailinsurance, xlab = "price house", ylab = "claim")

# Compute robust estimates for location and scatter
mcdresult <- robustbase::covMcd(hailinsurance)
robustcenter <- mcdresult$center
robustcov <- mcdresult$cov

# Add robust 97.5% tolerance ellipsoid
rad <- sqrt(qchisq(0.975, df=ncol(hailinsurance)))
car::ellipse(center = robustcenter, shape = robustcov, radius = rad, col = "red")


Dimensionality Reduction in R

Chapter 1 - Principal Component Analysis

The curse of dimensionality:

  • The “curse of dimensionality” can make it challenging to understand the structure of a data set
    • As the dimensionalities of the data grow, the feature space grows rapidly
    • Big computational cost to handle high-dimensional data
    • Estimation accuracy decreases
    • Difficult interpretation of the data
  • Example of using the mtcars dataset (11 features)
    • Most of the dimensions could probably be reduced due to a small set of latent dimensions, such as:
    • the size of the car or
    • the country of origin or
    • the construction year
  • Need to convert any factor variables (e.g., cylinders) and then run correlations
    • mtcars\(cyl <- as.numeric(as.character(mtcars\)cyl))
    • mtcars_correl <- cor(mtcars, use = “complete.obs”)
    • library(ggcorrplot)
    • ggcorrplot(mtcars_correl)
  • Several options for dealing with the curse of dimensionality
    • Feature Engineering: Requires domain knowledge
    • Remove redundancy

Getting PCA to work with FactoMineR:

  • PCA removes noise introduced by correlation, changes the coordinate dimensions to best explain variability in the data, etc.
    • Center and standardize
    • Rotation and projection
    • Projection and reduction
  • The screeplot can be helpful for finding elbow points for percentage of explained variance
  • Can implement using prcomp() or FactoMineR::PCA() in R
    • mtcars_pca <- prcomp(mtcars)
    • mtcars_pca <- FactoMineR::PCA(mtcars)
  • The eigenvalues and factor map can help with interpreting the outputs
    • mtcars_pca$eig
    • mtcars_pca\(var\)cos2 # squared cosine (closer to 1 is better)
    • mtcars_pca\(var\)contrib # contribution of each variable to the components
    • dimdesc(mtcars_pca)

Interpreting and visualizing PCA models:

  • Can plot the contributions of the individual principal components
    • fviz_pca_var(mtcars_pca, col.var = “contrib”, gradient.cols = c(“#bb2e00”, “#002bbb”), repel = TRUE)
    • fviz_pca_var(mtcars_pca, select.var = list(contrib = 4), repel = TRUE)
    • fviz_contrib(mtcars_pca, choice = “var”, axes = 1, top = 5)
    • fviz_pca_ind(mtcars_pca, col.ind=“cos2”, gradient.cols = c(“#bb2e00”, “#002bbb”), repel = TRUE)
    • fviz_pca_ind(mtcars_pca, select.ind = list(cos2 = 0.8), gradient.cols = c(“#bb2e00”, “#002bbb”), repel = TRUE)
    • fviz_cos2(mtcars_pca, choice = “ind”, axes = 1, top = 10)
    • fviz_pca_biplot(mtcars_pca)
  • Can add ellispoids to the data
    • mtcars\(cyl <- as.factor(mtcars\)cyl)
    • fviz_pca_ind(mtcars_pca, label=“var”, habillage=mtcars$cyl, addEllipses=TRUE)

Example code includes:

cars <- as.data.frame(data.table::fread("./RInputFiles/04carsdata.csv"))

rowsDelete <- c(59, 65, 71, 83, 84, 85, 108, 109, 116, 119, 124, 127, 128, 138, 143, 146, 147, 148, 205, 239, 240, 244, 245, 247, 248, 256, 291, 293, 295, 296, 297, 304, 315, 321, 325, 355, 399, 400, 401, 414, 415)
cars$`Vehicle Name`[rowsDelete]
##  [1] "Cadillac Escalade EXT"               
##  [2] "Chevrolet Avalanche 1500"            
##  [3] "Chevrolet Colorado Z85"              
##  [4] "Chevrolet Silverado 1500 Regular Cab"
##  [5] "Chevrolet Silverado SS"              
##  [6] "Chevrolet SSR"                       
##  [7] "Dodge Dakota Club Cab"               
##  [8] "Dodge Dakota Regular Cab"            
##  [9] "Dodge Ram 1500 Regular Cab ST"       
## [10] "Dodge Viper SRT-10 convertible 2dr"  
## [11] "Ford Excursion 6.8 XLT"              
## [12] "Ford F-150 Regular Cab XL"           
## [13] "Ford F-150 Supercab Lariat"          
## [14] "Ford Ranger 2.3 XL Regular Cab"      
## [15] "GMC Canyon Z85 SL Regular Cab"       
## [16] "GMC Sierra Extended Cab 1500"        
## [17] "GMC Sierra HD 2500"                  
## [18] "GMC Sonoma Crew Cab"                 
## [19] "Kia Amanti 4dr"                      
## [20] "Mazda B2300 SX Regular Cab"          
## [21] "Mazda B4000 SE Cab Plus"             
## [22] "Mazda RX-8 4dr automatic"            
## [23] "Mazda RX-8 4dr manual"               
## [24] "Mazda3 i 4dr"                        
## [25] "Mazda3 s 4dr"                        
## [26] "Mercedes-Benz C320 4dr"              
## [27] "Mitsubishi Galant ES 2.4L 4dr"       
## [28] "Mitsubishi Lancer ES 4dr"            
## [29] "Mitsubishi Lancer LS 4dr"            
## [30] "Mitsubishi Lancer OZ Rally 4dr auto" 
## [31] "Mitsubishi Lancer Sportback LS"      
## [32] "Nissan Frontier King Cab XE V6"      
## [33] "Nissan Titan King Cab XE"            
## [34] "Pontiac Bonneville GXP 4dr"          
## [35] "Pontiac GTO 2dr"                     
## [36] "Subaru Baja"                         
## [37] "Toyota Tacoma"                       
## [38] "Toyota Tundra Access Cab V6 SR5"     
## [39] "Toyota Tundra Regular Cab V6"        
## [40] "Volkswagen Phaeton 4dr"              
## [41] "Volkswagen Phaeton W12 4dr"
rowsMod <- c(182, 183, 252, 253, 255, 256)
cars$`Vehicle Name`[rowsMod]
## [1] "Infiniti G35 4dr"       "Infiniti G35 4dr"      
## [3] "Mercedes-Benz C240 4dr" "Mercedes-Benz C240 4dr"
## [5] "Mercedes-Benz C320 4dr" "Mercedes-Benz C320 4dr"
rowNames <- cars$`Vehicle Name`
rowNames[182] <- paste0(rowNames[182], " RWD")
rowNames[183] <- paste0(rowNames[183], " AWD")
rowNames[252] <- paste0(rowNames[252], " RWD")
rowNames[253] <- paste0(rowNames[253], " AWD")
rowNames[255] <- paste0(rowNames[255], " RWD")
rowNames[256] <- paste0(rowNames[256], " AWD")

row.names(cars) <- rowNames
cars <- cars[-rowsDelete, ]

cars$`Vehicle Name` <- NULL
cars$type <- factor(c(3, 3, 5, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 3, 3, 3, 4, 6, 3, 4, 4, 4, 3, 3, 3, 3, 6, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 4, 4, 3, 3, 3, 3, 3, 5, 3, 3, 5, 3, 3, 3, 5, 3, 5, 4, 1, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 6, 3, 3, 5, 5, 5, 5, 1, 3, 3, 3, 3, 3, 4, 6, 3, 3, 3, 3, 3, 3, 1, 1, 5, 1, 5, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 3, 3, 3, 6, 3, 3, 1, 4, 4, 3, 6, 3, 4, 5, 1, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 3, 1, 1, 5, 4, 5, 3, 3, 3, 3, 3, 3, 5, 3, 3, 4, 3, 3, 6, 6, 3, 3, 3, 3, 3, 3, 5, 5, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 5, 5, 5, 3, 3, 3, 3, 6, 1, 5, 3, 3, 3, 5, 5, 5, 3, 3, 3, 5, 3, 3, 6, 3, 5, 5, 4, 5, 3, 3, 3, 3, 5, 3, 3, 3, 1, 4, 4, 5, 3, 3, 6, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 3, 6, 3, 5, 5, 3, 3, 4, 4, 4, 4, 4, 3, 3, 3, 3, 1, 5, 6, 3, 3, 3, 3, 3, 4, 4, 5, 3, 4, 5, 5, 4, 4, 3, 3, 3, 3, 6, 5, 5, 1, 1, 3, 3, 3, 5, 3, 3, 1, 5, 3, 3, 3, 1, 1, 3, 3, 6, 4, 4, 4, 4, 4, 4, 5, 3, 3, 3, 3, 6, 3, 3, 3, 6, 3, 3, 3, 3, 3, 5, 3, 6, 6, 3, 4, 4, 3, 3, 6, 3, 3, 3, 3, 3, 6, 3, 3, 3, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 5, 5, 6, 4, 3, 5, 5, 1, 1, 3, 3, 6, 3, 3, 3, 3, 6, 3, 3, 6, 3, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6, 5), levels=c(1, 2, 3, 4, 5, 6), labels=c('Minivan', 'Pickup', 'Small.Sporty..Compact.Large.Sedan', 'Sports.Car', 'SUV', 'Wagon'))
cars$wheeltype <- factor(c(1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1), levels=c(1, 2), labels=c("AWD", "RWD"))

colNames <- c('Small.Sporty..Compact.Large.Sedan', 'Sports.Car', 'SUV', 'Wagon', 'Minivan', 'Pickup', 'AWD', 'RWD', 'Retail.Price', 'Dealer.Cost', 'Engine.Size..l.', 'Cyl', 'HP', 'City.MPG', 'Hwy.MPG', 'Weight', 'Wheel.Base', 'Len', 'Width', 'type', 'wheeltype')
for (intCtr in seq_along(colNames)) {
    cat("Original Name: ", names(cars)[intCtr], " ---> New Name: ", colNames[intCtr], "\n")
}
## Original Name:  Small/Sporty/ Compact/Large Sedan  ---> New Name:  Small.Sporty..Compact.Large.Sedan 
## Original Name:  Sports Car  ---> New Name:  Sports.Car 
## Original Name:  SUV  ---> New Name:  SUV 
## Original Name:  Wagon  ---> New Name:  Wagon 
## Original Name:  Minivan  ---> New Name:  Minivan 
## Original Name:  Pickup  ---> New Name:  Pickup 
## Original Name:  AWD  ---> New Name:  AWD 
## Original Name:  RWD  ---> New Name:  RWD 
## Original Name:  Retail Price  ---> New Name:  Retail.Price 
## Original Name:  Dealer Cost  ---> New Name:  Dealer.Cost 
## Original Name:  Engine Size (l)  ---> New Name:  Engine.Size..l. 
## Original Name:  Cyl  ---> New Name:  Cyl 
## Original Name:  HP  ---> New Name:  HP 
## Original Name:  City MPG  ---> New Name:  City.MPG 
## Original Name:  Hwy MPG  ---> New Name:  Hwy.MPG 
## Original Name:  Weight  ---> New Name:  Weight 
## Original Name:  Wheel Base  ---> New Name:  Wheel.Base 
## Original Name:  Len  ---> New Name:  Len 
## Original Name:  Width  ---> New Name:  Width 
## Original Name:  type  ---> New Name:  type 
## Original Name:  wheeltype  ---> New Name:  wheeltype
names(cars) <- colNames

cars <- cars %>% mutate(City.MPG=as.integer(City.MPG), Hwy.MPG=as.integer(Hwy.MPG), 
                        Weight=as.integer(Weight), Wheel.Base=as.integer(Wheel.Base), 
                        Len=as.integer(Len), Width=as.integer(Width)
                        )
str(cars)
## 'data.frame':    387 obs. of  21 variables:
##  $ Small.Sporty..Compact.Large.Sedan: int  1 1 0 0 1 1 1 1 1 1 ...
##  $ Sports.Car                       : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ SUV                              : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Wagon                            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Minivan                          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pickup                           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AWD                              : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ RWD                              : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Retail.Price                     : int  43755 46100 36945 89765 23820 33195 26990 25940 31840 42490 ...
##  $ Dealer.Cost                      : int  39014 41100 33337 79978 21761 30299 24647 23508 28846 38325 ...
##  $ Engine.Size..l.                  : num  3.5 3.5 3.5 3.2 2 3.2 2.4 1.8 3 3 ...
##  $ Cyl                              : int  6 6 6 6 4 6 4 4 6 6 ...
##  $ HP                               : int  225 225 265 290 200 270 200 170 220 220 ...
##  $ City.MPG                         : int  18 18 17 17 24 20 22 22 20 20 ...
##  $ Hwy.MPG                          : int  24 24 23 24 31 28 29 31 28 27 ...
##  $ Weight                           : int  3880 3893 4451 3153 2778 3575 3230 3252 3462 3814 ...
##  $ Wheel.Base                       : int  115 115 106 100 101 108 105 104 104 105 ...
##  $ Len                              : int  197 197 189 174 172 186 183 179 179 180 ...
##  $ Width                            : int  72 72 77 71 68 72 69 70 70 70 ...
##  $ type                             : Factor w/ 6 levels "Minivan","Pickup",..: 3 3 5 4 3 3 3 3 3 3 ...
##  $ wheeltype                        : Factor w/ 2 levels "AWD","RWD": 1 2 1 2 2 1 2 2 2 1 ...
# Explore cars with summary()
summary(cars)
##  Small.Sporty..Compact.Large.Sedan   Sports.Car          SUV        
##  Min.   :0.0000                    Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000                    1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000                    Median :0.0000   Median :0.0000  
##  Mean   :0.6047                    Mean   :0.1163   Mean   :0.1525  
##  3rd Qu.:1.0000                    3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000                    Max.   :1.0000   Max.   :1.0000  
##      Wagon            Minivan            Pickup       AWD        
##  Min.   :0.00000   Min.   :0.00000   Min.   :0   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0   1st Qu.:0.0000  
##  Median :0.00000   Median :0.00000   Median :0   Median :0.0000  
##  Mean   :0.07494   Mean   :0.05168   Mean   :0   Mean   :0.2016  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :0   Max.   :1.0000  
##       RWD          Retail.Price     Dealer.Cost     Engine.Size..l.
##  Min.   :0.0000   Min.   : 10280   Min.   :  9875   Min.   :1.400  
##  1st Qu.:0.0000   1st Qu.: 20997   1st Qu.: 19575   1st Qu.:2.300  
##  Median :0.0000   Median : 28495   Median : 26155   Median :3.000  
##  Mean   :0.2429   Mean   : 33231   Mean   : 30441   Mean   :3.127  
##  3rd Qu.:0.0000   3rd Qu.: 39553   3rd Qu.: 36124   3rd Qu.:3.800  
##  Max.   :1.0000   Max.   :192465   Max.   :173560   Max.   :6.000  
##       Cyl               HP           City.MPG        Hwy.MPG     
##  Min.   : 3.000   Min.   : 73.0   Min.   :10.00   Min.   :12.00  
##  1st Qu.: 4.000   1st Qu.:165.0   1st Qu.:18.00   1st Qu.:24.00  
##  Median : 6.000   Median :210.0   Median :19.00   Median :27.00  
##  Mean   : 5.757   Mean   :214.4   Mean   :20.31   Mean   :27.26  
##  3rd Qu.: 6.000   3rd Qu.:250.0   3rd Qu.:21.50   3rd Qu.:30.00  
##  Max.   :12.000   Max.   :493.0   Max.   :60.00   Max.   :66.00  
##      Weight       Wheel.Base         Len          Width      
##  Min.   :1850   Min.   : 89.0   Min.   :143   Min.   :64.00  
##  1st Qu.:3107   1st Qu.:103.0   1st Qu.:177   1st Qu.:69.00  
##  Median :3469   Median :107.0   Median :186   Median :71.00  
##  Mean   :3532   Mean   :107.2   Mean   :185   Mean   :71.28  
##  3rd Qu.:3922   3rd Qu.:112.0   3rd Qu.:193   3rd Qu.:73.00  
##  Max.   :6400   Max.   :130.0   Max.   :221   Max.   :81.00  
##                                 type     wheeltype
##  Minivan                          : 20   AWD:183  
##  Pickup                           :  0   RWD:204  
##  Small.Sporty..Compact.Large.Sedan:234            
##  Sports.Car                       : 45            
##  SUV                              : 59            
##  Wagon                            : 29
# Get the correlation matrix with cor()
correl <- cor(cars[,9:19], use = "complete.obs")

# Use ggcorrplot() to explore the correlation matrix
ggcorrplot::ggcorrplot(correl)

# Conduct hierarchical clustering on the correlation matrix
ggcorrplot_clustered <- ggcorrplot::ggcorrplot(correl, hc.order = TRUE, type = "lower")
ggcorrplot_clustered

# Run a PCA for the 10 non-binary numeric variables of cars
pca_output_ten_v <- FactoMineR::PCA(cars[,9:19], ncp = 4, graph = F)

# Get the summary of the first 100 cars
summary(pca_output_ten_v, nbelements = 100)
## 
## Call:
## FactoMineR::PCA(X = cars[, 9:19], ncp = 4, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               7.105   1.884   0.850   0.357   0.275   0.198
## % of var.             64.588  17.127   7.725   3.246   2.504   1.799
## Cumulative % of var.  64.588  81.714  89.439  92.685  95.189  96.988
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.087   0.066   0.037   0.001
## % of var.              1.277   0.788   0.604   0.336   0.007
## Cumulative % of var.  98.266  99.053  99.657  99.993 100.000
## 
## Individuals (the 100 first)
##                     Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1               |  1.887 |  1.567  0.089  0.690 |  0.447  0.027  0.056 |
## 2               |  1.961 |  1.636  0.097  0.696 |  0.340  0.016  0.030 |
## 3               |  2.535 |  1.907  0.132  0.565 |  0.411  0.023  0.026 |
## 4               |  4.457 |  1.590  0.092  0.127 | -3.863  2.046  0.751 |
## 5               |  2.810 | -2.655  0.256  0.892 | -0.654  0.059  0.054 |
## 6               |  0.865 |  0.441  0.007  0.260 | -0.081  0.001  0.009 |
## 7               |  1.766 | -1.538  0.086  0.759 | -0.028  0.000  0.000 |
## 8               |  2.248 | -2.036  0.151  0.820 |  0.062  0.001  0.001 |
## 9               |  0.803 | -0.410  0.006  0.261 | -0.471  0.030  0.344 |
## 10              |  1.006 |  0.151  0.001  0.023 | -0.802  0.088  0.636 |
## 11              |  0.991 |  0.023  0.000  0.001 | -0.569  0.044  0.329 |
## 12              |  1.027 | -0.023  0.000  0.000 | -0.526  0.038  0.262 |
## 13              |  1.362 |  0.522  0.010  0.147 | -0.847  0.098  0.387 |
## 14              |  2.109 | -1.534  0.086  0.529 | -0.232  0.007  0.012 |
## 15              |  1.347 |  0.906  0.030  0.452 | -0.253  0.009  0.035 |
## 16              |  0.680 |  0.347  0.004  0.261 |  0.114  0.002  0.028 |
## 17              |  1.246 |  0.914  0.030  0.539 |  0.018  0.000  0.000 |
## 18              |  1.097 |  0.809  0.024  0.544 |  0.038  0.000  0.001 |
## 19              |  2.810 |  2.500  0.227  0.792 | -0.794  0.086  0.080 |
## 20              |  4.739 |  4.383  0.699  0.856 | -0.393  0.021  0.007 |
## 21              |  5.893 |  4.914  0.878  0.696 | -2.492  0.852  0.179 |
## 22              |  3.316 |  2.358  0.202  0.505 | -1.782  0.436  0.289 |
## 23              |  3.396 |  2.391  0.208  0.496 | -1.763  0.426  0.269 |
## 24              |  3.290 | -1.964  0.140  0.356 | -1.361  0.254  0.171 |
## 25              |  3.258 | -1.783  0.116  0.299 | -1.610  0.355  0.244 |
## 26              |  2.713 | -0.564  0.012  0.043 | -1.844  0.466  0.462 |
## 27              |  1.332 | -0.963  0.034  0.522 | -0.326  0.015  0.060 |
## 28              |  1.266 | -0.429  0.007  0.115 | -0.588  0.047  0.216 |
## 29              |  1.389 | -1.033  0.039  0.553 | -0.242  0.008  0.030 |
## 30              |  1.280 | -0.703  0.018  0.302 | -0.273  0.010  0.046 |
## 31              |  1.280 | -0.515  0.010  0.162 | -0.367  0.019  0.082 |
## 32              |  1.192 | -0.471  0.008  0.156 | -0.794  0.086  0.443 |
## 33              |  1.270 |  0.069  0.000  0.003 | -1.037  0.147  0.666 |
## 34              |  1.172 | -0.511  0.010  0.190 | -0.722  0.071  0.380 |
## 35              |  1.082 | -0.316  0.004  0.085 | -0.760  0.079  0.494 |
## 36              |  1.526 |  0.394  0.006  0.067 |  0.505  0.035  0.109 |
## 37              |  1.556 |  0.739  0.020  0.225 |  0.118  0.002  0.006 |
## 38              |  3.252 |  2.888  0.303  0.789 | -0.787  0.085  0.059 |
## 39              |  4.384 |  3.998  0.581  0.832 | -0.692  0.066  0.025 |
## 40              |  4.985 |  4.452  0.721  0.797 | -0.381  0.020  0.006 |
## 41              |  2.759 |  1.612  0.095  0.342 | -1.893  0.492  0.471 |
## 42              |  2.368 |  1.159  0.049  0.239 | -1.577  0.341  0.444 |
## 43              |  1.558 |  1.041  0.039  0.446 |  0.039  0.000  0.001 |
## 44              |  3.747 |  3.470  0.438  0.857 | -0.723  0.072  0.037 |
## 45              |  2.544 | -1.513  0.083  0.354 | -1.472  0.297  0.335 |
## 46              |  2.488 | -1.046  0.040  0.177 | -1.958  0.526  0.619 |
## 47              |  1.478 | -0.232  0.002  0.025 |  1.218  0.203  0.679 |
## 48              |  1.797 |  0.713  0.019  0.158 |  1.372  0.258  0.583 |
## 49              |  1.733 |  0.880  0.028  0.258 |  1.101  0.166  0.404 |
## 50              |  2.378 |  1.352  0.066  0.323 |  1.424  0.278  0.359 |
## 51              |  2.509 |  1.884  0.129  0.564 |  1.060  0.154  0.178 |
## 52              |  2.969 |  2.683  0.262  0.817 |  0.712  0.070  0.058 |
## 53              |  1.399 |  0.803  0.023  0.329 |  0.731  0.073  0.273 |
## 54              |  1.464 |  0.265  0.003  0.033 |  1.051  0.152  0.515 |
## 55              |  1.501 |  0.672  0.016  0.200 |  1.153  0.182  0.590 |
## 56              |  1.354 |  0.969  0.034  0.513 |  0.436  0.026  0.104 |
## 57              |  3.348 |  2.993  0.326  0.799 |  0.565  0.044  0.028 |
## 58              |  3.586 |  3.273  0.390  0.833 |  0.229  0.007  0.004 |
## 59              |  5.389 |  5.174  0.974  0.922 |  0.728  0.073  0.018 |
## 60              |  3.174 |  2.921  0.310  0.847 |  0.180  0.004  0.003 |
## 61              |  3.627 |  3.504  0.447  0.933 | -0.049  0.000  0.000 |
## 62              |  4.168 |  2.908  0.308  0.487 | -2.810  1.083  0.454 |
## 63              |  3.640 |  2.434  0.216  0.447 |  1.676  0.385  0.212 |
## 64              |  4.568 | -4.533  0.747  0.985 | -0.290  0.012  0.004 |
## 65              |  4.977 | -4.791  0.835  0.927 | -0.771  0.082  0.024 |
## 66              |  3.415 | -3.208  0.374  0.882 |  0.614  0.052  0.032 |
## 67              |  3.450 | -3.262  0.387  0.894 |  0.526  0.038  0.023 |
## 68              |  3.367 | -3.159  0.363  0.881 |  0.528  0.038  0.025 |
## 69              |  3.799 |  2.462  0.220  0.420 | -1.243  0.212  0.107 |
## 70              |  3.925 |  2.646  0.255  0.455 | -1.570  0.338  0.160 |
## 71              |  1.893 | -0.056  0.000  0.001 |  1.519  0.316  0.644 |
## 72              |  1.718 |  0.428  0.007  0.062 |  1.291  0.229  0.565 |
## 73              |  1.675 |  0.979  0.035  0.341 |  1.007  0.139  0.361 |
## 74              |  2.583 | -2.264  0.186  0.768 |  0.882  0.107  0.117 |
## 75              |  1.320 | -0.689  0.017  0.272 |  0.523  0.038  0.157 |
## 76              |  1.409 | -0.764  0.021  0.293 |  0.381  0.020  0.073 |
## 77              |  1.355 | -0.336  0.004  0.061 |  0.826  0.094  0.371 |
## 78              |  1.826 | -0.157  0.001  0.007 |  1.431  0.281  0.614 |
## 79              |  1.638 |  0.575  0.012  0.123 |  1.251  0.215  0.584 |
## 80              |  6.246 |  5.594  1.138  0.802 |  2.600  0.928  0.173 |
## 81              |  4.990 |  4.655  0.788  0.870 |  1.178  0.190  0.056 |
## 82              |  3.105 | -1.828  0.122  0.347 | -1.000  0.137  0.104 |
## 83              |  2.763 |  2.318  0.195  0.704 |  1.002  0.138  0.131 |
## 84              |  1.087 |  0.353  0.005  0.106 |  0.867  0.103  0.637 |
## 85              |  2.413 | -1.829  0.122  0.575 | -0.819  0.092  0.115 |
## 86              |  1.723 |  1.145  0.048  0.442 |  1.002  0.138  0.338 |
## 87              |  1.737 |  1.291  0.061  0.553 |  0.835  0.096  0.231 |
## 88              |  2.267 |  0.348  0.004  0.024 |  1.766  0.428  0.607 |
## 89              |  2.202 |  1.100  0.044  0.249 |  1.526  0.319  0.480 |
## 90              |  2.782 | -0.841  0.026  0.091 | -1.816  0.452  0.426 |
## 91              |  3.461 |  2.548  0.236  0.542 |  1.870  0.480  0.292 |
## 92              |  2.815 | -2.584  0.243  0.842 | -0.170  0.004  0.004 |
## 93              |  2.716 | -2.475  0.223  0.830 | -0.361  0.018  0.018 |
## 94              |  2.144 | -1.609  0.094  0.563 |  1.147  0.180  0.286 |
## 95              |  2.941 | -1.989  0.144  0.457 |  0.219  0.007  0.006 |
## 96              |  1.129 | -0.406  0.006  0.130 |  0.121  0.002  0.011 |
## 97              |  1.166 | -0.565  0.012  0.235 |  0.714  0.070  0.375 |
## 98              |  3.429 |  2.500  0.227  0.531 |  1.845  0.467  0.290 |
## 99              |  3.266 |  1.643  0.098  0.253 |  2.437  0.815  0.557 |
## 100             |  4.624 |  4.152  0.627  0.806 |  1.541  0.326  0.111 |
##                  Dim.3    ctr   cos2  
## 1                0.287  0.025  0.023 |
## 2                0.346  0.036  0.031 |
## 3               -0.553  0.093  0.048 |
## 4                0.357  0.039  0.006 |
## 5               -0.173  0.009  0.004 |
## 6                0.190  0.011  0.048 |
## 7               -0.098  0.003  0.003 |
## 8               -0.095  0.003  0.002 |
## 9               -0.278  0.024  0.120 |
## 10              -0.103  0.003  0.011 |
## 11              -0.757  0.174  0.583 |
## 12              -0.774  0.182  0.567 |
## 13              -0.504  0.077  0.137 |
## 14               0.161  0.008  0.006 |
## 15              -0.004  0.000  0.000 |
## 16               0.275  0.023  0.163 |
## 17              -0.106  0.003  0.007 |
## 18              -0.112  0.004  0.010 |
## 19               0.194  0.011  0.005 |
## 20               1.544  0.725  0.106 |
## 21               0.976  0.289  0.027 |
## 22              -0.864  0.227  0.068 |
## 23              -1.081  0.355  0.101 |
## 24              -1.109  0.374  0.114 |
## 25              -0.950  0.275  0.085 |
## 26              -0.526  0.084  0.038 |
## 27              -0.181  0.010  0.018 |
## 28              -0.366  0.041  0.083 |
## 29              -0.270  0.022  0.038 |
## 30              -0.578  0.101  0.204 |
## 31              -0.638  0.124  0.248 |
## 32               0.115  0.004  0.009 |
## 33              -0.036  0.000  0.001 |
## 34               0.076  0.002  0.004 |
## 35              -0.016  0.000  0.000 |
## 36               0.545  0.090  0.127 |
## 37               1.050  0.336  0.456 |
## 38               0.911  0.252  0.078 |
## 39               1.575  0.754  0.129 |
## 40               2.008  1.227  0.162 |
## 41              -0.388  0.046  0.020 |
## 42              -0.442  0.059  0.035 |
## 43              -0.834  0.212  0.287 |
## 44              -0.221  0.015  0.003 |
## 45              -0.892  0.242  0.123 |
## 46              -0.440  0.059  0.031 |
## 47               0.342  0.036  0.054 |
## 48               0.648  0.128  0.130 |
## 49               0.795  0.192  0.210 |
## 50               1.129  0.388  0.226 |
## 51               0.971  0.287  0.150 |
## 52              -0.609  0.113  0.042 |
## 53               0.188  0.011  0.018 |
## 54               0.482  0.071  0.108 |
## 55              -0.227  0.016  0.023 |
## 56              -0.140  0.006  0.011 |
## 57               1.071  0.349  0.102 |
## 58               1.224  0.455  0.116 |
## 59              -0.202  0.012  0.001 |
## 60               0.892  0.242  0.079 |
## 61               0.080  0.002  0.000 |
## 62               0.635  0.123  0.023 |
## 63              -1.598  0.777  0.193 |
## 64              -0.113  0.004  0.001 |
## 65              -0.451  0.062  0.008 |
## 66               0.828  0.209  0.059 |
## 67               0.798  0.194  0.053 |
## 68               0.875  0.233  0.068 |
## 69               0.089  0.002  0.001 |
## 70               0.267  0.022  0.005 |
## 71               0.856  0.223  0.205 |
## 72               0.661  0.133  0.148 |
## 73               0.346  0.036  0.043 |
## 74               0.564  0.097  0.048 |
## 75               0.180  0.010  0.019 |
## 76               0.574  0.100  0.166 |
## 77               0.428  0.056  0.100 |
## 78               0.822  0.205  0.203 |
## 79               0.179  0.010  0.012 |
## 80               0.622  0.117  0.010 |
## 81              -0.524  0.084  0.011 |
## 82              -2.017  1.237  0.422 |
## 83              -0.702  0.150  0.065 |
## 84              -0.215  0.014  0.039 |
## 85              -0.901  0.247  0.139 |
## 86               0.344  0.036  0.040 |
## 87               0.427  0.055  0.060 |
## 88               0.896  0.244  0.156 |
## 89               0.611  0.114  0.077 |
## 90              -1.587  0.766  0.325 |
## 91              -0.059  0.001  0.000 |
## 92              -0.852  0.221  0.092 |
## 93              -0.748  0.170  0.076 |
## 94               0.135  0.006  0.004 |
## 95               0.087  0.002  0.001 |
## 96               0.215  0.014  0.036 |
## 97               0.062  0.001  0.003 |
## 98               0.635  0.123  0.034 |
## 99               0.543  0.090  0.028 |
## 100             -0.340  0.035  0.005 |
## 
## Variables
##                    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## Retail.Price    |  0.703  6.956  0.494 | -0.643 21.950  0.414 |  0.235
## Dealer.Cost     |  0.699  6.881  0.489 | -0.645 22.104  0.416 |  0.237
## Engine.Size..l. |  0.925 12.046  0.856 |  0.021  0.024  0.000 |  0.044
## Cyl             |  0.891 11.168  0.793 | -0.107  0.609  0.011 |  0.075
## HP              |  0.849 10.151  0.721 | -0.401  8.539  0.161 |  0.070
## City.MPG        | -0.828  9.640  0.685 |  0.005  0.001  0.000 |  0.493
## Hwy.MPG         | -0.817  9.400  0.668 |  0.015  0.012  0.000 |  0.552
## Weight          |  0.896 11.312  0.804 |  0.230  2.804  0.053 | -0.103
## Wheel.Base      |  0.710  7.087  0.503 |  0.574 17.487  0.329 |  0.244
## Len             |  0.684  6.594  0.468 |  0.561 16.680  0.314 |  0.318
## Width           |  0.789  8.765  0.623 |  0.429  9.790  0.184 |  0.081
##                    ctr   cos2  
## Retail.Price     6.501  0.055 |
## Dealer.Cost      6.618  0.056 |
## Engine.Size..l.  0.223  0.002 |
## Cyl              0.663  0.006 |
## HP               0.583  0.005 |
## City.MPG        28.629  0.243 |
## Hwy.MPG         35.880  0.305 |
## Weight           1.259  0.011 |
## Wheel.Base       6.994  0.059 |
## Len             11.882  0.101 |
## Width            0.767  0.007 |
# Get the variance of the first 3 new dimensions
pca_output_ten_v$eig[,2][1:3]
##    comp 1    comp 2    comp 3 
## 64.587622 17.126589  7.724803
# Get the cumulative variance
pca_output_ten_v$eig[,3][1:3]
##   comp 1   comp 2   comp 3 
## 64.58762 81.71421 89.43901
# Run a PCA with active and supplementary variables
pca_output_all <- FactoMineR::PCA(cars, quanti.sup = 1:8, quali.sup = 20:21, graph = F)

# Get the most correlated variables
FactoMineR::dimdesc(pca_output_all, axes = 1:2)
## $Dim.1
## $Dim.1$quanti
##                                   correlation       p.value
## <NA>                                       NA            NA
## Engine.Size..l.                     0.9251267 5.116179e-164
## Weight                              0.8964700 3.638599e-138
## Cyl                                 0.8907643 6.262131e-134
## HP                                  0.8492193 8.059693e-109
## Width                               0.7891195  1.664116e-83
## Wheel.Base                          0.7095703  1.671006e-60
## Retail.Price                        0.7030143  5.914584e-59
## Dealer.Cost                         0.6991979  4.510056e-58
## Len                                 0.6844621  8.580649e-55
## SUV                                 0.3149124  2.343396e-10
## RWD                                 0.3064581  7.365567e-10
## AWD                                 0.2192594  1.346164e-05
## Minivan                             0.1435461  4.663306e-03
## Small.Sporty..Compact.Large.Sedan  -0.3062320  7.590854e-10
## Hwy.MPG                            -0.8171975  3.651747e-94
## City.MPG                           -0.8275744  1.404114e-98
## 
## $Dim.1$quali
##             R2      p.value
## type 0.1505098 8.763997e-13
## 
## $Dim.1$category
##                                    Estimate      p.value
## SUV                                1.414274 2.343396e-10
## Minivan                            1.074161 4.663306e-03
## Small.Sporty..Compact.Large.Sedan -1.224868 7.590854e-10
## 
## 
## $Dim.2
## $Dim.2$quanti
##                                   correlation      p.value
## <NA>                                       NA           NA
## Wheel.Base                          0.5739738 2.730836e-35
## Len                                 0.5605697 2.095168e-33
## Width                               0.4294626 8.445613e-19
## Minivan                             0.3206132 1.060475e-10
## Weight                              0.2298540 4.913548e-06
## SUV                                 0.1675328 9.379368e-04
## Small.Sporty..Compact.Large.Sedan   0.1164972 2.189605e-02
## Cyl                                -0.1071037 3.518485e-02
## RWD                                -0.3862412 3.231605e-15
## HP                                 -0.4010809 2.175520e-16
## Sports.Car                         -0.5919253 5.917164e-38
## Retail.Price                       -0.6430569 1.540011e-46
## Dealer.Cost                        -0.6453051 5.917390e-47
## 
## $Dim.2$quali
##                   R2      p.value
## type      0.43630221 2.376020e-46
## wheeltype 0.01220005 2.981756e-02
## 
## $Dim.2$category
##                                      Estimate      p.value
## Minivan                            1.81590533 1.060475e-10
## SUV                                0.47299804 9.379368e-04
## Small.Sporty..Compact.Large.Sedan  0.06011519 2.189605e-02
## AWD                                0.15182838 2.981756e-02
## RWD                               -0.15182838 2.981756e-02
## Sports.Car                        -2.30896125 5.917164e-38
# Run a PCA on the first 100 car categories
pca_output_hundred <- FactoMineR::PCA(cars, quanti.sup = 1:8, quali.sup = 20:21, 
                                      ind.sup = 1:100, graph = F
                                      )

# Trace variable contributions in pca_output_hundred
pca_output_hundred$var$contrib
##                     Dim.1        Dim.2      Dim.3       Dim.4        Dim.5
## Retail.Price     6.699758 2.301732e+01  5.6092132  6.73645561 6.685871e-01
## Dealer.Cost      6.636507 2.316262e+01  5.6749619  7.12398875 5.247519e-01
## Engine.Size..l. 12.207533 3.398443e-05  0.1690355 25.15549588 1.161249e-01
## Cyl             11.092661 5.960251e-01  0.6920445 44.70555192 3.208953e+00
## HP              10.006863 8.872771e+00  0.3887040  0.05144773 3.440815e+00
## City.MPG         9.535081 2.019316e-02 30.3660045  4.74002551 1.019129e+01
## Hwy.MPG          9.456488 8.985550e-03 35.8045505  1.36652382 1.798497e-04
## Weight          11.373558 2.643370e+00  0.7304973  0.10317032 2.668755e+01
## Wheel.Base       7.031113 1.701242e+01  7.7390269  4.84277075 2.178306e+00
## Len              7.107357 1.448978e+01 11.4900264  4.36687777 2.383013e+01
## Width            8.853081 1.017648e+01  1.3359353  0.80769192 2.915331e+01
# Run a PCA using the 10 non-binary numeric variables
cars_pca <- ade4::dudi.pca(cars[,9:19], scannf = FALSE, nf = 4)

# Explore the summary of cars_pca
summary(cars_pca)
## Class: pca dudi
## Call: ade4::dudi.pca(df = cars[, 9:19], scannf = FALSE, nf = 4)
## 
## Total inertia: 11
## 
## Eigenvalues:
##     Ax1     Ax2     Ax3     Ax4     Ax5 
##  7.1046  1.8839  0.8497  0.3570  0.2754 
## 
## Projected inertia (%):
##     Ax1     Ax2     Ax3     Ax4     Ax5 
##  64.588  17.127   7.725   3.246   2.504 
## 
## Cumulative projected inertia (%):
##     Ax1   Ax1:2   Ax1:3   Ax1:4   Ax1:5 
##   64.59   81.71   89.44   92.68   95.19 
## 
## (Only 5 dimensions (out of 11) are shown)
# Explore the summary of pca_output_ten_v
summary(pca_output_ten_v)
## 
## Call:
## FactoMineR::PCA(X = cars[, 9:19], ncp = 4, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               7.105   1.884   0.850   0.357   0.275   0.198
## % of var.             64.588  17.127   7.725   3.246   2.504   1.799
## Cumulative % of var.  64.588  81.714  89.439  92.685  95.189  96.988
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.087   0.066   0.037   0.001
## % of var.              1.277   0.788   0.604   0.336   0.007
## Cumulative % of var.  98.266  99.053  99.657  99.993 100.000
## 
## Individuals (the 10 first)
##                     Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1               |  1.887 |  1.567  0.089  0.690 |  0.447  0.027  0.056 |
## 2               |  1.961 |  1.636  0.097  0.696 |  0.340  0.016  0.030 |
## 3               |  2.535 |  1.907  0.132  0.565 |  0.411  0.023  0.026 |
## 4               |  4.457 |  1.590  0.092  0.127 | -3.863  2.046  0.751 |
## 5               |  2.810 | -2.655  0.256  0.892 | -0.654  0.059  0.054 |
## 6               |  0.865 |  0.441  0.007  0.260 | -0.081  0.001  0.009 |
## 7               |  1.766 | -1.538  0.086  0.759 | -0.028  0.000  0.000 |
## 8               |  2.248 | -2.036  0.151  0.820 |  0.062  0.001  0.001 |
## 9               |  0.803 | -0.410  0.006  0.261 | -0.471  0.030  0.344 |
## 10              |  1.006 |  0.151  0.001  0.023 | -0.802  0.088  0.636 |
##                  Dim.3    ctr   cos2  
## 1                0.287  0.025  0.023 |
## 2                0.346  0.036  0.031 |
## 3               -0.553  0.093  0.048 |
## 4                0.357  0.039  0.006 |
## 5               -0.173  0.009  0.004 |
## 6                0.190  0.011  0.048 |
## 7               -0.098  0.003  0.003 |
## 8               -0.095  0.003  0.002 |
## 9               -0.278  0.024  0.120 |
## 10              -0.103  0.003  0.011 |
## 
## Variables (the 10 first)
##                    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## Retail.Price    |  0.703  6.956  0.494 | -0.643 21.950  0.414 |  0.235
## Dealer.Cost     |  0.699  6.881  0.489 | -0.645 22.104  0.416 |  0.237
## Engine.Size..l. |  0.925 12.046  0.856 |  0.021  0.024  0.000 |  0.044
## Cyl             |  0.891 11.168  0.793 | -0.107  0.609  0.011 |  0.075
## HP              |  0.849 10.151  0.721 | -0.401  8.539  0.161 |  0.070
## City.MPG        | -0.828  9.640  0.685 |  0.005  0.001  0.000 |  0.493
## Hwy.MPG         | -0.817  9.400  0.668 |  0.015  0.012  0.000 |  0.552
## Weight          |  0.896 11.312  0.804 |  0.230  2.804  0.053 | -0.103
## Wheel.Base      |  0.710  7.087  0.503 |  0.574 17.487  0.329 |  0.244
## Len             |  0.684  6.594  0.468 |  0.561 16.680  0.314 |  0.318
##                    ctr   cos2  
## Retail.Price     6.501  0.055 |
## Dealer.Cost      6.618  0.056 |
## Engine.Size..l.  0.223  0.002 |
## Cyl              0.663  0.006 |
## HP               0.583  0.005 |
## City.MPG        28.629  0.243 |
## Hwy.MPG         35.880  0.305 |
## Weight           1.259  0.011 |
## Wheel.Base       6.994  0.059 |
## Len             11.882  0.101 |
# Create a factor map for the variables
factoextra::fviz_pca_var(pca_output_all, select.var = list(cos2 = 0.7), repel = TRUE)

# Modify the code to create a factor map for the individuals
factoextra::fviz_pca_ind(pca_output_all, select.ind = list(cos2 = 0.7), repel = TRUE)

# Create a barplot for the variables with the highest cos2 in the 1st PC
factoextra::fviz_cos2(pca_output_all, choice = "var", axes = 1, top = 10)

# Create a barplot for the variables with the highest cos2 in the 2nd PC
factoextra::fviz_cos2(pca_output_all, choice = "var", axes = 2, top = 10)

# Create a factor map for the top 5 variables with the highest contributions
factoextra::fviz_pca_var(pca_output_all, select.var = list(contrib = 5), repel = TRUE)

# Create a factor map for the top 5 individuals with the highest contributions
factoextra::fviz_pca_ind(pca_output_all, select.ind = list(contrib = 5), repel = TRUE)

# Create a barplot for the variables with the highest contributions to the 1st PC
factoextra::fviz_contrib(pca_output_all, choice = "var", axes = 1, top = 5)

# Create a barplot for the variables with the highest contributions to the 2nd PC
factoextra::fviz_contrib(pca_output_all, choice = "var", axes = 2, top = 5)

# Create a biplot with no labels for all individuals with the geom argument.
factoextra::fviz_pca_biplot(pca_output_all)

# Create ellipsoids for wheeltype columns respectively.
factoextra::fviz_pca_ind(pca_output_all, habillage = cars$wheeltype, addEllipses = TRUE)

# Create the biplot with ellipsoids
factoextra::fviz_pca_biplot(pca_output_all, habillage=cars$wheeltype, addEllipses=TRUE, alpha.var="cos2")


Chapter 2 - Advanced PCA and Non-Negative Matrix Factorization (NNMF)

Determining the right number of PCs:

  • Desire to have stopping rules to determine the proper number of principal components
  • The screeplot can help to find an elbow, which is often a good stopping point
    • mtcars_pca <- PCA(mtcars)
    • fviz_screeplot(mtcars_pca, ncp=5)
  • The Kaiser-Guttman rule recommends keeping components with eigenvalues > 1
    • mtcars_pca$eig
    • get_eigenvalue(mtcars_pca)
  • Parallel analysis is superior to the previous tests - compares variance explained by PC to what would be expected with random data - select “above the line”
    • library(paran)
    • mtcars_pca_ret <- paran(mtcars_pca, graph = TRUE)

Performing PCA on datasets with missing values:

  • Missing values can cause problems for PCA
    • Skipping rows with missing values can lead to too little data while also biasing the results
    • Mean imputation is quick and dirty, but distorts the distribution of the variable
    • A preferred approach is to estimate each NA as a linear combination of the other parameters
  • The missMDA library can help with these estimations
    • library(missMDA)
    • nPCs <- estim_ncpPCA(as.matrix(sleep))
    • completed_sleep <- imputePCA(sleep, ncp = nPCs$ncp, scale = TRUE)
    • PCA(completed_sleep$completeObs)
  • Can also impute missing values in the pca() function
    • library(pcaMethods)
    • sleep_pca_methods <- pca(sleep, nPcs=2, method=“ppca”, center = TRUE)
    • imp_air_pcamethods <- completeObs(sleep_pca_methods)

NNMF and Topic Detection with nmf():

  • Non-negative matrix factorization is designed for cases where negative values should never be predicted
    • Tries to decompose an mxn matrix in to mxr and rxn
  • Dimensionality reduction is critical for sparse data (e.g., text by document)
  • The NMF library implements nmf in R
    • library(NMF)
    • bbc_res <- nmf(bbc_tdm, 5) # sets r=5
    • W <- basis(bbc_res) # terms as rows, topic as columns
    • H <- coef(bbc_res) # topics as rows, documents as columns
    • colnames(W) <- c(“topic1”, “topic2”, “topic3”, “topic4”, “topic5”)
    • W %>% rownames_to_column(‘words’) %>% arrange(. , desc(topic1))%>% column_to_rownames(‘words’)

Example code includes:

data(airquality, package="datasets")
airquality <- airquality[complete.cases(airquality), ]

# Conduct a PCA on the airquality dataset
pca_air <- FactoMineR::PCA(airquality)

# Apply the Kaiser-Guttman rule
summary(pca_air, ncp = 4)
## 
## Call:
## FactoMineR::PCA(X = airquality) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               2.469   1.113   0.998   0.768   0.425   0.227
## % of var.             41.147  18.552  16.640  12.804   7.078   3.778
## Cumulative % of var.  41.147  59.699  76.339  89.144  96.222 100.000
## 
## Individuals (the 10 first)
##             Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1       |  2.659 | -0.566  0.117  0.045 | -1.463  1.732  0.303 | -1.370
## 2       |  2.474 | -0.664  0.161  0.072 | -0.859  0.598  0.121 | -1.365
## 3       |  2.494 | -1.488  0.808  0.356 | -1.029  0.857  0.170 | -1.424
## 4       |  3.110 | -1.454  0.772  0.219 | -2.327  4.383  0.560 | -1.092
## 7       |  2.688 | -0.866  0.273  0.104 | -2.101  3.572  0.611 | -0.710
## 8       |  3.101 | -2.607  2.480  0.707 | -0.780  0.493  0.063 | -0.812
## 9       |  4.323 | -3.780  5.215  0.765 | -0.236  0.045  0.003 | -0.919
## 12      |  2.142 | -1.067  0.415  0.248 | -1.640  2.178  0.586 | -0.248
## 13      |  2.493 | -1.164  0.494  0.218 | -1.867  2.822  0.561 | -0.111
## 14      |  2.271 | -1.283  0.601  0.319 | -1.757  2.498  0.598 | -0.034
##            ctr   cos2    Dim.4    ctr   cos2  
## 1        1.693  0.265 | -1.493  2.615  0.316 |
## 2        1.682  0.305 | -1.752  3.600  0.502 |
## 3        1.830  0.326 | -0.647  0.491  0.067 |
## 4        1.077  0.123 |  0.045  0.002  0.000 |
## 7        0.455  0.070 | -0.446  0.233  0.028 |
## 8        0.595  0.069 | -0.999  1.171  0.104 |
## 9        0.761  0.045 | -0.429  0.216  0.010 |
## 12       0.055  0.013 | -0.392  0.180  0.033 |
## 13       0.011  0.002 | -0.253  0.075  0.010 |
## 14       0.001  0.000 | -0.065  0.005  0.001 |
## 
## Variables
##            Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Ozone   |  0.875 30.997  0.765 | -0.152  2.062  0.023 |  0.149  2.220
## Solar.R |  0.432  7.569  0.187 | -0.693 43.144  0.480 |  0.064  0.408
## Wind    | -0.756 23.171  0.572 | -0.073  0.475  0.005 | -0.058  0.336
## Temp    |  0.877 31.155  0.769 |  0.105  0.997  0.011 | -0.047  0.217
## Month   |  0.405  6.654  0.164 |  0.749 50.438  0.561 | -0.143  2.035
## Day     | -0.106  0.454  0.011 |  0.179  2.884  0.032 |  0.973 94.784
##           cos2    Dim.4    ctr   cos2  
## Ozone    0.022 | -0.166  3.596  0.028 |
## Solar.R  0.004 |  0.521 35.323  0.271 |
## Wind     0.003 |  0.487 30.824  0.237 |
## Temp     0.002 |  0.130  2.189  0.017 |
## Month    0.020 |  0.455 26.940  0.207 |
## Day      0.946 |  0.093  1.128  0.009 |
# Perform the screeplot test
factoextra::fviz_screeplot(pca_air, ncp = 5)

data(airquality, package="datasets")

# Conduct a parallel analysis with paran().
air_paran <- paran::paran(airquality[complete.cases(airquality), ])
## 
## Using eigendecomposition of correlation matrix.
## Computing: 10%  20%  30%  40%  50%  60%  70%  80%  90%  100%
## 
## 
## Results of Horn's Parallel Analysis for component retention
## 180 iterations, using the mean estimate
## 
## -------------------------------------------------- 
## Component   Adjusted    Unadjusted    Estimated 
##             Eigenvalue  Eigenvalue    Bias 
## -------------------------------------------------- 
## 1           2.145699    2.468840      0.323141
## -------------------------------------------------- 
## 
## Adjusted eigenvalues > 1 indicate dimensions to retain.
## (1 components retained)
# Check out air_paran's suggested number of PCs to retain.
air_paran$Retained
## [1] 1
# Conduct a parallel analysis.
air_fa_parallel <- psych::fa.parallel(airquality)
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate =
## rotate, : A loading greater than abs(1) was detected. Examine the loadings
## carefully.
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs
## = np.obs, : The estimated weights for the factor scores are probably
## incorrect. Try a different factor extraction method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate =
## rotate, : An ultra-Heywood case was detected. Examine the results carefully
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate =
## rotate, : A loading greater than abs(1) was detected. Examine the loadings
## carefully.
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs
## = np.obs, : The estimated weights for the factor scores are probably
## incorrect. Try a different factor extraction method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate =
## rotate, : An ultra-Heywood case was detected. Examine the results carefully
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs
## = np.obs, : The estimated weights for the factor scores are probably
## incorrect. Try a different factor extraction method.

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs
## = np.obs, : The estimated weights for the factor scores are probably
## incorrect. Try a different factor extraction method.

## Parallel analysis suggests that the number of factors =  3  and the number of components =  2
# Check out air_fa_parallel's suggested number of PCs to retain.
air_fa_parallel$ncomp
## [1] 2
# Check out the summary of airquality
summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
# Check out the number of cells with missing values.
sum(is.na(airquality))
## [1] 44
# Check out the number of rows with missing values.
nrow(airquality[!complete.cases(airquality), ])
## [1] 42
# Estimate the optimal number of dimensions for imputation.
missMDA::estim_ncpPCA(airquality, ncp.max=5)
## $ncp
## [1] 0
## 
## $criterion
##        0        1        2        3        4        5 
## 1520.506 1823.946 1771.702 2774.323 2888.306 6369.592
bbc_res <- readRDS("./RInputFiles/bbc_res.rds")

# Get a 5-rank approximation of corpus_tdm.
# bbc_res <- NMF::nmf(corpus_tdm, 5)

# Get the term-topic matrix W.
W <- NMF::basis(bbc_res)

# Check out the dimensions of W.
dim(W)
## [1] 3137    5
# Normalize W.
normal <- function(x) { x / sum(x) }
normal_W <- apply(W, 2, FUN=normal)


# Get the topic-text matrix H.
H <- coef(bbc_res)

# Check out the dimensions of H.
dim(H)
## [1]  5 50
# Normalize H.
normal_H <- apply(H, 2, FUN=normal)


# Explore the nmf's algorithms.
alg <- NMF::nmfAlgorithm()

# Choose the algorithms implemented in R.
R_alg <- NMF::nmfAlgorithm(version="R")

# Use the two-version algorithms.
# bbc_double_opt <- NMF::nmf(x=corpus_tdm, rank=5, method=R_alg, .options="v")

Chapter 3 - Exploratory Factor Analysis

Intro to EFA:

  • Variance and covariance are only partially explained by factors
    • Latent constructs drive observed variables by way of loadings (weightings)
    • Model accepts that not all variance can be explained by the latent constructs
  • Example of analyzing a single variable
    • Check for data factorability
    • Extract factors
    • Choose the “right” number of factors to retain
    • Rotate factors
    • Interpret the results
  • Example of running the process on the bfi dataset from psych

Intro to EFA: Data Factorability:

  • Need to check whether the dataset is factorable
  • Bartlett sphericity test - drawback is that it is always significant for large datasets
    • H0: There is no significant difference between the correlation matrix and the identity matrix of the same dimensionality
    • H1: There is significant difference betweeen them and, thus, we have strong evidence that there are underlying factors
    • library(polycor)
    • bfi_s <- bfi[1:200, 1:25]
    • bfi_hetcor <- hetcor(bfi_s)
    • bfi_c <- bfi_hetcor$correlations
    • bfi_factorability <- cortest.bartlett(bfi_c)
  • Kaiser-Meyer-Olkin (KMO) is also known as the measure of sampling adequacy
    • KMO(bfi_c)
    • Should be in the 0.60s to be acceptable

Extraction methods:

  • Can extract factors using fa(), with extraction methods including
    • minres: minimum residual [default] (slightly modified methods: ols, wls, gls)
    • mle: Maximum Likelihood Estimation (MLE)
    • paf: Principal Axes Factor (PAF) extraction
    • minchi: minimum sample size weighted chi square
    • minrank: minimum rank
    • alpha: alpha factoring
  • Example of using the minres extraction method
    • library(psych)
    • library(GPArotation)
    • f_bfi_minres <- fa(bfi_c, nfactors = 3, rotate = “none”)
    • f_bfi_minres_common <- sort(f_bfi_minres$communality, decreasing = TRUE)
    • data.frame(f_bfi_minres_common)
    • f_bfi_mle <- fa(bfi_c, nfactors = 3, fm = “mle”, rotate = “none”)
    • f_bfi_mle_common <- sort(f_bfi_mle$communality, decreasing = TRUE)
    • data.frame(f_bfi_mle_common)

Choosing the right number of factors:

  • Arriving at the right number of factors is not always an easy task
    • fa.parallel(bfi_c, n.obs = 200, fa = “fa”, fm = “minres”)
    • fa.parallel(bfi_c, n.obs = 200, fa = “fa”, fm = “mle”)

Example code includes:

hsq <- readr::read_delim("./RInputFiles/humor_dataset.csv", delim=";")
## Parsed with column specification:
## cols(
##   .default = col_double()
## )
## See spec(...) for full column specifications.
str(hsq, give.attr=FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1071 obs. of  39 variables:
##  $ Q1           : num  2 2 3 3 1 3 4 2 2 4 ...
##  $ Q2           : num  2 3 4 3 4 3 1 4 2 2 ...
##  $ Q3           : num  3 2 3 3 2 3 2 4 1 4 ...
##  $ Q4           : num  1 2 3 4 2 2 4 1 1 1 ...
##  $ Q5           : num  4 4 4 3 3 3 2 5 3 3 ...
##  $ Q6           : num  5 4 4 5 5 3 3 5 4 5 ...
##  $ Q7           : num  4 4 3 4 4 4 3 4 3 4 ...
##  $ Q8           : num  3 3 1 3 1 2 3 3 1 2 ...
##  $ Q9           : num  4 4 2 -1 4 2 4 2 3 3 ...
##  $ Q10          : num  3 3 4 4 4 1 4 4 3 1 ...
##  $ Q11          : num  3 4 3 2 2 3 4 3 2 5 ...
##  $ Q12          : num  1 3 2 4 2 3 1 3 2 3 ...
##  $ Q13          : num  5 3 4 4 5 4 2 5 5 1 ...
##  $ Q14          : num  4 4 4 5 4 4 1 4 3 3 ...
##  $ Q15          : num  4 5 3 4 4 4 2 3 3 1 ...
##  $ Q16          : num  4 4 3 3 4 3 4 3 4 5 ...
##  $ Q17          : num  2 2 2 3 2 2 4 3 2 5 ...
##  $ Q18          : num  3 2 4 3 3 1 1 4 2 1 ...
##  $ Q19          : num  3 3 2 3 2 4 3 5 4 3 ...
##  $ Q20          : num  1 2 1 3 1 2 1 3 1 1 ...
##  $ Q21          : num  4 3 4 4 5 4 3 4 4 2 ...
##  $ Q22          : num  4 3 2 3 3 4 2 3 4 1 ...
##  $ Q23          : num  3 4 4 2 3 4 2 3 4 5 ...
##  $ Q24          : num  2 2 3 4 1 2 3 1 2 2 ...
##  $ Q25          : num  1 2 2 2 1 2 4 1 1 4 ...
##  $ Q26          : num  3 5 4 4 5 3 3 4 3 5 ...
##  $ Q27          : num  2 1 3 2 2 2 2 2 4 5 ...
##  $ Q28          : num  4 2 3 2 3 4 2 4 4 2 ...
##  $ Q29          : num  2 4 2 4 2 3 3 1 1 1 ...
##  $ Q30          : num  4 4 5 5 5 4 3 5 5 5 ...
##  $ Q31          : num  2 3 4 3 4 3 4 2 2 3 ...
##  $ Q32          : num  2 1 2 3 2 3 4 2 1 2 ...
##  $ affiliative  : num  4 3.3 3.9 3.6 4.1 3.6 2.3 4.4 4.1 2.4 ...
##  $ selfenhancing: num  3.5 3.5 3.9 4 4.1 2.9 2.3 4.1 3.3 2.9 ...
##  $ agressive    : num  3 3.3 3.1 2.9 2.9 3.4 2.8 3.3 2.9 3.8 ...
##  $ selfdefeating: num  2.3 2.4 2.3 3.3 2 2.6 2.8 2.5 2 2.3 ...
##  $ age          : num  25 44 50 30 52 30 27 34 30 18 ...
##  $ gender       : num  2 2 1 2 1 2 1 1 2 1 ...
##  $ accuracy     : num  100 90 75 85 80 60 60 88 95 85 ...
# Check out the dimensionality of hsq.
dim(hsq)
## [1] 1071   39
# Explore the correlation object hsq_correl.
hsq_correl <- psych::mixedCor(hsq, c=NULL, p=1:32)
## Warning in matpLower(x, nvar, gminx, gmaxx, gminy, gmaxy): 496 cells were
## adjusted for 0 values using the correction for continuity. Examine your
## data carefully.
str(hsq_correl)
## List of 6
##  $ rho  : num [1:32, 1:32] 1 -0.2094 -0.1772 -0.0945 -0.4466 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ rx   : NULL
##  $ poly :List of 4
##   ..$ rho  : num [1:32, 1:32] 1 -0.2094 -0.1772 -0.0945 -0.4466 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   ..$ tau  : num [1:32, 1:6] -2.77 -2.77 -2.9 -3.11 -2.9 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. .. ..$ : chr [1:6] "1" "2" "3" "4" ...
##   ..$ n.obs: int 1071
##   ..$ Call : language polychoric(x = data[, p], smooth = smooth, global = global, weight = weight,      correct = correct)
##   ..- attr(*, "class")= chr [1:2] "psych" "poly"
##  $ tetra:List of 2
##   ..$ rho: NULL
##   ..$ tau: NULL
##  $ rpd  : NULL
##  $ Call : language psych::mixedCor(data = hsq, c = NULL, p = 1:32)
##  - attr(*, "class")= chr [1:2] "psych" "mixed"
# Getting the correlation matrix of the dataset.
hsq_polychoric <- hsq_correl$poly$rho

# Explore the correlation structure of the dataset.
ggcorrplot::ggcorrplot(hsq_polychoric)

# Apply the Bartlett test on the correlation matrix.
psych::cortest.bartlett(hsq_polychoric)
## Warning in psych::cortest.bartlett(hsq_polychoric): n not specified, 100
## used
## $chisq
## [1] 1114.409
## 
## $p.value
## [1] 1.610583e-49
## 
## $df
## [1] 496
# Check the KMO index.
psych::KMO(hsq_polychoric)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = hsq_polychoric)
## Overall MSA =  0.87
## MSA for each item = 
##   Q1   Q2   Q3   Q4   Q5   Q6   Q7   Q8   Q9  Q10  Q11  Q12  Q13  Q14  Q15 
## 0.94 0.93 0.91 0.90 0.91 0.88 0.82 0.86 0.95 0.86 0.78 0.90 0.85 0.93 0.82 
##  Q16  Q17  Q18  Q19  Q20  Q21  Q22  Q23  Q24  Q25  Q26  Q27  Q28  Q29  Q30 
## 0.85 0.87 0.83 0.89 0.83 0.87 0.84 0.81 0.84 0.83 0.89 0.83 0.93 0.87 0.81 
##  Q31  Q32 
## 0.81 0.91
# EFA with four factors. 
f_hsq <- psych::fa(hsq_polychoric, nfactors=4)

# Inspect the resulting EFA object.
str(f_hsq)
## List of 44
##  $ residual     : num [1:32, 1:32] 0.5202 0.01625 0.02522 -0.00162 0.00493 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ dof          : num 374
##  $ ENull        : num NA
##  $ chi          : num NA
##  $ rms          : num 0.0411
##  $ nh           : logi NA
##  $ EPVAL        : num NA
##  $ crms         : num 0.0473
##  $ EBIC         : num NA
##  $ ESABIC       : num NA
##  $ fit          : num 0.849
##  $ fit.off      : num 0.969
##  $ sd           : num 0.0397
##  $ factors      : num 4
##  $ complexity   : Named num [1:32] 1.02 1.05 1.29 1 1.08 ...
##   ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ n.obs        : logi NA
##  $ PVAL         : logi NA
##  $ objective    : num 2.19
##  $ criteria     : Named num [1:3] 2.19 NA NA
##   ..- attr(*, "names")= chr [1:3] "objective" "" ""
##  $ Call         : language psych::fa(r = hsq_polychoric, nfactors = 4)
##  $ null.model   : num 12.7
##  $ null.dof     : num 496
##  $ r.scores     : num [1:4, 1:4] 1 -0.166 -0.434 0.212 -0.166 ...
##  $ R2           : num [1:4] 0.886 0.867 0.866 0.821
##  $ valid        : num [1:4] 0.93 0.915 0.902 0.889
##  $ score.cor    : num [1:4, 1:4] 1 -0.198 -0.44 0.229 -0.198 ...
##  $ weights      : num [1:32, 1:4] 0.14289 -0.01175 -0.01633 0.00872 -0.13143 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  $ rotation     : chr "oblimin"
##  $ communality  : Named num [1:32] 0.48 0.408 0.363 0.407 0.433 ...
##   ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ communalities: Named num [1:32] 0.48 0.408 0.363 0.407 0.433 ...
##   ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ uniquenesses : Named num [1:32] 0.52 0.592 0.637 0.593 0.567 ...
##   ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ values       : num [1:32] 6.37 2.96 2.4 1.74 0.78 ...
##  $ e.values     : num [1:32] 6.91 3.5 2.96 2.28 1.45 ...
##  $ loadings     : 'loadings' num [1:32, 1:4] 0.6746 -0.0849 -0.1127 0.0182 -0.5985 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  $ model        : num [1:32, 1:32] 0.4798 -0.2256 -0.2024 -0.0929 -0.4516 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ fm           : chr "minres"
##  $ rot.mat      : num [1:4, 1:4] 0.534 -0.503 -0.279 -0.764 0.308 ...
##  $ Phi          : num [1:4, 1:4] 1 -0.142 -0.375 0.18 -0.142 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  $ Structure    : 'loadings' num [1:32, 1:4] 0.6895 -0.3145 -0.2541 -0.0802 -0.6464 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  $ method       : chr "regression"
##  $ R2.scores    : Named num [1:4] 0.886 0.867 0.866 0.821
##   ..- attr(*, "names")= chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  $ r            : num [1:32, 1:32] 1 -0.2094 -0.1772 -0.0945 -0.4466 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ fn           : chr "fa"
##  $ Vaccounted   : num [1:5, 1:4] 3.973 0.124 0.124 0.295 0.295 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "SS loadings" "Proportion Var" "Cumulative Var" "Proportion Explained" ...
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  - attr(*, "class")= chr [1:2] "psych" "fa"
# Use maximum likelihood for extracting factors.
psych::fa(hsq_polychoric, nfactors=4, fm="mle")
## Factor Analysis using method =  ml
## Call: psych::fa(r = hsq_polychoric, nfactors = 4, fm = "mle")
## Standardized loadings (pattern matrix) based upon correlation matrix
##       ML1   ML2   ML4   ML3   h2   u2 com
## Q1   0.67 -0.05 -0.01  0.04 0.48 0.52 1.0
## Q2  -0.09 -0.03  0.61 -0.04 0.42 0.58 1.1
## Q3  -0.12  0.13  0.08 -0.50 0.35 0.65 1.3
## Q4   0.01  0.62  0.01  0.00 0.39 0.61 1.0
## Q5  -0.60  0.07  0.08 -0.01 0.42 0.58 1.1
## Q6  -0.28 -0.03  0.42 -0.03 0.34 0.66 1.8
## Q7  -0.17 -0.03 -0.01  0.61 0.37 0.63 1.2
## Q8  -0.03  0.76  0.01  0.01 0.58 0.42 1.0
## Q9   0.47 -0.12  0.00  0.03 0.26 0.74 1.1
## Q10  0.05  0.05  0.77 -0.03 0.58 0.42 1.0
## Q11  0.13  0.00  0.15 -0.53 0.29 0.71 1.3
## Q12 -0.14  0.66 -0.04  0.05 0.45 0.55 1.1
## Q13 -0.68  0.00  0.11  0.01 0.53 0.47 1.1
## Q14 -0.18 -0.02  0.64  0.02 0.51 0.49 1.2
## Q15  0.04 -0.05  0.07  0.69 0.50 0.50 1.0
## Q16  0.10 -0.53  0.11  0.15 0.33 0.67 1.3
## Q17  0.76 -0.03  0.04  0.05 0.58 0.42 1.0
## Q18  0.11  0.00  0.80  0.00 0.58 0.42 1.0
## Q19 -0.14  0.15  0.21 -0.40 0.34 0.66 2.2
## Q20  0.11  0.79 -0.01 -0.07 0.64 0.36 1.1
## Q21 -0.66  0.13  0.13  0.17 0.55 0.45 1.3
## Q22  0.11  0.06 -0.29  0.13 0.14 0.86 1.8
## Q23  0.11  0.04 -0.01  0.51 0.29 0.71 1.1
## Q24  0.20  0.49  0.08  0.05 0.27 0.73 1.4
## Q25  0.77  0.06 -0.01  0.02 0.59 0.41 1.0
## Q26 -0.08  0.03  0.68  0.04 0.52 0.48 1.0
## Q27  0.09  0.07  0.11 -0.51 0.28 0.72 1.2
## Q28 -0.10  0.27  0.24 -0.13 0.23 0.77 2.8
## Q29  0.59  0.18  0.06  0.16 0.38 0.62 1.4
## Q30 -0.02 -0.13  0.51  0.00 0.25 0.75 1.1
## Q31  0.10  0.07  0.07  0.70 0.51 0.49 1.1
## Q32 -0.08  0.67  0.06  0.02 0.50 0.50 1.0
## 
##                        ML1  ML2  ML4  ML3
## SS loadings           4.02 3.30 3.38 2.75
## Proportion Var        0.13 0.10 0.11 0.09
## Cumulative Var        0.13 0.23 0.33 0.42
## Proportion Explained  0.30 0.25 0.25 0.20
## Cumulative Proportion 0.30 0.54 0.80 1.00
## 
##  With factor correlations of 
##       ML1   ML2   ML4   ML3
## ML1  1.00 -0.14 -0.38  0.18
## ML2 -0.14  1.00  0.25 -0.17
## ML4 -0.38  0.25  1.00 -0.05
## ML3  0.18 -0.17 -0.05  1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  496  and the objective function was  12.74
## The degrees of freedom for the model are 374  and the objective function was  2.18 
## 
## The root mean square of the residuals (RMSR) is  0.04 
## The df corrected root mean square of the residuals is  0.05 
## 
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy             
##                                                    ML1  ML2  ML4  ML3
## Correlation of (regression) scores with factors   0.94 0.93 0.93 0.91
## Multiple R square of scores with factors          0.89 0.87 0.87 0.82
## Minimum correlation of possible factor scores     0.78 0.74 0.74 0.64
# Use PAF on hsq_polychoric.
hsq_correl_pa <- psych::fa(hsq_polychoric, nfactors=4, fm="pa")

# Sort the communalities of the f_hsq_pa.
f_hsq_pa_common <- sort(hsq_correl_pa$communality, decreasing = TRUE)

# Sort the uniqueness of the f_hsq_pa.
f_hsq_pa_unique <- sort(hsq_correl_pa$uniqueness, decreasing = TRUE)


# Check out the scree test and the Kaiser-Guttman criterion.
psych::scree(hsq_polychoric)

# Use parallel analysis for estimation with the minres extraction method.
psych::fa.parallel(hsq_polychoric, n.obs = 1069, fm = "minres", fa = "fa")

## Parallel analysis suggests that the number of factors =  7  and the number of components =  NA
# Use parallel analysis for estimation with the mle extraction method.
psych::fa.parallel(hsq_polychoric, n.obs = 1069, fm = "mle", fa = "fa")

## Parallel analysis suggests that the number of factors =  7  and the number of components =  NA

Chapter 4 - Advanced EFA

Interpretation of EFA and factor rotation:

  • Factor rotation is a standard step for simplifying interpretation
    • Orthogonal - uncorrelated factors with 90-degree angles (generally good for unrelated factors, and easier to interpret since they are fully orthogonal)
    • Oblique - uncorrelated factors, but can have slight differences from 90-degree angles (generally good for related factors)
  • Example of the psych::bfi data analysis
    • f_bfi_varimax <- fa(bfi_c, fm = “minres”, nfactors = 5, rotate = “varimax”)

Interpretation of EFA and path diagrams:

  • Interpretation is an important component of the modeling process
    • fa.diagram(f_bfi_varimax)
    • print(f_bfi_varimax$loadings, cut=0) # cut=0 will show the values between 0.0 and 0.1 (the default would be to exclude the 0.1 and under)

EFA case study:

  • The “short dark triad” combines machiavellianism, narcissism, and psychopathy
  • A short version of the test is available at https://openpsychometrics.org/tests/SD3/
    • sdt_test <- read.csv(“SD3.csv”, sep = “”)
    • dim(sdt_test)
    • head(sdt_test)
  • General steps for EFA on the data will include
    • Check for data factorability
    • Extract factors
    • Choose the “right” number of factors to retain
    • Rotate factors
    • Interpret the results

Wrap up:

  • Biggest challenge is in handling large amounts of data - computation, interpretation, etc.
  • PCA and NNMF (positive entries only) are popular choices for many types of datasets
  • EFA for exploratory purposes
  • Common steps in dimensionality reduction
    • Factor/Component/Dimension extraction
    • Decision on the number of Factor/Component/Dimension to retain - parsimonious, minimum information loss, easy interpretation
    • Use visual aid for interpretation (e.g. biplot)

Example code includes:

# Check the default rotation method.
f_hsq$rotation
## [1] "oblimin"
# Try Promax, another oblique rotation method.
f_hsq_promax <- psych::fa(hsq_polychoric, nfactors=4, rotate="promax")

# Try Varimax, an orthogonal method.
f_hsq_varimax <- psych::fa(hsq_polychoric, nfactors=4, rotate="varimax") 


# Check the factor loadings.
print(f_hsq$loadings, cut=0)
## 
## Loadings:
##     MR1    MR2    MR4    MR3   
## Q1   0.675 -0.055 -0.005  0.029
## Q2  -0.085 -0.023  0.604 -0.034
## Q3  -0.113  0.130  0.082 -0.512
## Q4   0.018  0.635  0.023  0.002
## Q5  -0.599  0.066  0.095 -0.015
## Q6  -0.257 -0.038  0.462 -0.040
## Q7  -0.166 -0.030  0.002  0.607
## Q8  -0.027  0.741  0.009  0.007
## Q9   0.485 -0.124  0.011  0.016
## Q10  0.034  0.056  0.736 -0.020
## Q11  0.139 -0.018  0.163 -0.541
## Q12 -0.142  0.663 -0.045  0.054
## Q13 -0.641  0.006  0.143  0.005
## Q14 -0.173 -0.018  0.644  0.022
## Q15  0.055 -0.044  0.078  0.688
## Q16  0.123 -0.523  0.118  0.126
## Q17  0.769 -0.035  0.043  0.050
## Q18  0.107  0.005  0.780 -0.004
## Q19 -0.122  0.141  0.229 -0.412
## Q20  0.099  0.779 -0.021 -0.082
## Q21 -0.641  0.131  0.153  0.156
## Q22  0.136  0.059 -0.267  0.105
## Q23  0.124  0.036  0.004  0.491
## Q24  0.218  0.509  0.083  0.042
## Q25  0.761  0.061 -0.020  0.007
## Q26 -0.078  0.033  0.685  0.032
## Q27  0.103  0.058  0.114 -0.530
## Q28 -0.075  0.272  0.248 -0.155
## Q29  0.607  0.182  0.068  0.151
## Q30  0.005 -0.133  0.538 -0.008
## Q31  0.107  0.066  0.080  0.693
## Q32 -0.074  0.694  0.051  0.023
## 
##                  MR1   MR2   MR4   MR3
## SS loadings    3.768 3.241 3.233 2.688
## Proportion Var 0.118 0.101 0.101 0.084
## Cumulative Var 0.118 0.219 0.320 0.404
# Create the path diagram of the latent factors.
psych::fa.diagram(f_hsq)

SD3 <- readRDS("./RInputFiles/SD3.RDS")
# SD3_mod <- SD3 %>% mutate_all(factor, levels=1:5)
sdt_sub_correl <- polycor::hetcor(SD3)


# Explore sdt_sub_correl.
str(sdt_sub_correl)
## List of 7
##  $ correlations: num [1:27, 1:27] 1 0.184 0.102 0.217 0.369 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##  $ type        : chr [1:27, 1:27] "" "Pearson" "Pearson" "Pearson" ...
##  $ NA.method   : chr "complete.obs"
##  $ ML          : logi FALSE
##  $ std.errors  : num [1:27, 1:27] 0 0.0969 0.0993 0.0956 0.0868 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##  $ n           : int 100
##  $ tests       : num [1:27, 1:27] 0.00 5.78e-13 1.55e-16 8.63e-14 4.36e-14 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##  - attr(*, "class")= chr "hetcor"
# Get the correlation matrix of the sdt_sub_correl.
sdt_polychoric <- sdt_sub_correl$correlations

# Apply the Bartlett test on the correlation matrix.
psych::cortest.bartlett(sdt_polychoric)
## Warning in psych::cortest.bartlett(sdt_polychoric): n not specified, 100
## used
## $chisq
## [1] 1019.442
## 
## $p.value
## [1] 2.054927e-66
## 
## $df
## [1] 351
# Check the KMO index.
psych::KMO(sdt_polychoric)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = sdt_polychoric)
## Overall MSA =  0.82
## MSA for each item = 
##   M1   M2   M3   M4   M5   M6   M7   M8   M9   N1   N2   N3   N4   N5   N6 
## 0.78 0.84 0.80 0.66 0.91 0.84 0.68 0.77 0.79 0.80 0.82 0.83 0.87 0.85 0.84 
##   N7   N8   N9   P1   P2   P3   P4   P5   P6   P7   P8   P9 
## 0.80 0.81 0.89 0.89 0.64 0.87 0.52 0.81 0.88 0.52 0.63 0.85
# Check out the scree test.
psych::scree(sdt_polychoric)

# Use parallel analysis for estimation with the minres extraction method.
psych::fa.parallel(sdt_polychoric, n.obs = 100, fa = "fa")

## Parallel analysis suggests that the number of factors =  4  and the number of components =  NA
# Perform EFA with MLE. 
f_sdt <- psych::fa(sdt_polychoric, fm = "ml", nfactors = 4)


# Check the factor loadings.
print(f_sdt$loadings, cut=0)
## 
## Loadings:
##    ML1    ML4    ML2    ML3   
## M1  0.005  0.043  0.578 -0.194
## M2  0.236  0.407  0.193  0.152
## M3 -0.019  0.654  0.023  0.091
## M4  0.029  0.329  0.254 -0.134
## M5  0.184  0.179  0.550  0.075
## M6  0.064 -0.099  0.849  0.055
## M7  0.104  0.171  0.438 -0.454
## M8  0.504  0.255 -0.025 -0.183
## M9  0.048  0.325  0.450  0.037
## N1  0.082  0.202  0.033  0.409
## N2  0.037 -0.160 -0.105 -0.501
## N3  0.221  0.056  0.012  0.615
## N4 -0.014  0.438  0.160  0.372
## N5 -0.059  0.580  0.107  0.166
## N6 -0.299 -0.300  0.104 -0.356
## N7 -0.189  0.346  0.222  0.219
## N8 -0.197 -0.058 -0.276 -0.334
## N9  0.754 -0.003  0.014 -0.017
## P1  0.411  0.012  0.296  0.053
## P2  0.001 -0.129 -0.089 -0.213
## P3  0.395 -0.008  0.220  0.020
## P4  0.015  0.104 -0.111  0.318
## P5  0.556  0.026  0.076  0.070
## P6  0.634 -0.047  0.174  0.139
## P7 -0.419  0.131  0.190 -0.016
## P8  0.101  0.594 -0.179 -0.277
## P9  0.261  0.525 -0.049  0.084
## 
##                  ML1   ML4   ML2   ML3
## SS loadings    2.445 2.432 2.304 1.844
## Proportion Var 0.091 0.090 0.085 0.068
## Cumulative Var 0.091 0.181 0.266 0.334
# Create the path diagram of the latent factors.
psych::fa.diagram(f_sdt)


Anomaly Detection in R

Chapter 1 - Statistical Outlier Detection

Meaning of anomalies:

  • Anomalies are data points that do not seem to follow the same patterns as the rest of the data
    • Point anomaly - a single point that is unusual compared with the rest of the data
    • boxplot(temperature, ylab = “Celsius”)
    • Collective anomaly - series of points that are unusual compared with the rest of the data

Testing extremes with Grubbs’ test:

  • Grubbs’ test assumes data are normally distributed (should be checked prior to running the analysis)
    • hist(temperature, breaks = 6)
    • grubbs.test(temperature)
  • Can get the row number that was flagged based on whether Grubbs’ test flagged the maximum or the mimimum value
    • which.max(weights)
    • which.min(weights)

Anomalies in time series:

  • Can begin by visualizing a time series using plot
    • plot(sales ~ month, data = msales, type = ‘o’)
  • The Seasonal Hybrid ESD algorithm usies the AnomalyDetection library
    • library(AnomalyDetection)
    • sales_ad <- AnomalyDetectionVec(x = msales$sales, period = 12, direction = ‘both’) # direction can also be “small” or “large”
    • sales_ad$anoms
    • AnomalyDetectionVec(x = msales$sales, period = 12, direction = ‘both’, plot = T) # plot the anomalies in blue

Example code includes:

river <- data.frame(index=1:291, 
                    nitrate=c(1.581, 1.323, 1.14, 1.245, 1.072, 1.483, 1.162, 1.304, 1.14, 1.118, 1.342, 1.245, 1.204, 1.14, 1.204, 1.118, 1.025, 1.118, 1.285, 1.14, 0.949, 0.922, 0.949, 1.118, 1.265, 1.095, 1.183, 1.162, 1.118, 1.285, 1.049, 0.922, 0.775, 0.866, 0.922, 1.643, 1.323, 1.285, 1.095, 1.049, 1.095, 0.922, 0.866, 1.049, 0.922, 1.095, 1.183, 1.304, 1.162, 1.225, 1.285, 1.072, 1.533, 1.095, 1.396, 1.025, 0.922, 0.949, 1.118, 1.342, 1.36, 1.36, 1.204, 1.265, 1, 1.183, 1.025, 0.866, 1.072, 1.049, 1.049, 1.049, 1.095, 1.183, 1.095, 0.975, 1.118, 0.975, 1.049, 0.837, 0.922, 1.118, 1.072, 1.204, 0.975, 1.095, 1.049, 0.866, 0.922, 1.049, 1.127, 1.072, 0.975, 1.049, 1.183, 1.245, 1.225, 1.225, 1.265, 1.118, 1.14, 1.072, 1.095, 0.671, 1.183, 0.949, 1.162, 1.095, 1.323, 1.342, 1.277, 1.015, 1, 0.922, 0.894, 1, 1.049, 0.922, 1.517, 1.265, 1.414, 1.304, 1.14, 1.14, 1.049, 1.068, 0.906, 1.095, 0.883, 1.14, 1.025, 1.36, 1.183, 1.265, 1.304, 0.964, 0.975, 0.99, 0.877, 1.049, 0.975, 1, 1.183, 1.225, 1.265, 1.183, 1.049, 0.97, 0.894, 0.98, 0.964, 0.894, 0.922, 1.14, 1.183, 1.897, 1.095, 1.14, 1.414, 1.14, 1, 1.049, 0.889, 0.872, 1, 1.095, 0.671, 1.095, 1.14, 1.304, 1.025, 0.975, 1, 0.877, 0.949, 0.866, 1.058, 1.086, 1.118, 1.162, 1.221, 1.265, 1.122, 1.015, 1.162, 0.825, 0.906, 0.849, 0.985, 1.118, 1.077, 1.237, 1.237, 1.063, 1.01, 0.933, 0.922, 0.806, 0.748, 0.592, 0.911, 0.806, 0.98, 1.077, 1.212, 1.277, 0.954, 0.837, 0.917, 0.9, 1.068, 0.872, 0.99, 1.131, 1.068, 1.208, 1.319, 1.281, 0.905, 0.819, 0.826, 0.974, 0.888, 0.804, 0.996, 1.127, 1.17, 1.166, 1.261, 1.275, 1.179, 1.079, 0.951, 0.852, 0.872, 0.834, 0.859, 1.077, 1.095, 1.285, 1.323, 1.16, 1.125, 0.957, 0.948, 0.907, 0.89, 0.999, 0.999, 0.953, 0.9, 0.986, 1.187, 1.054, 1.079, 0.997, 0.851, 0.803, 0.971, 1.025, 1.086, 1.114, 1.068, 1.091, 1.034, 0.871, 0.781, 0.865, 0.7, 0.673, 0.881, 0.782, 0.97, 1.044, 1.17, 1.196, 1.091, 1.068, 0.967, 0.823, 0.73, 0.693, 0.788, 1.095, 1.183, 0.996, 1.105, 0.939, 0.914, 0.813, 0.775), 
                    month=factor(month.name[c(rep(1:12, times=24), 1:3)], levels=month.name)
                    )
str(river)
## 'data.frame':    291 obs. of  3 variables:
##  $ index  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ nitrate: num  1.58 1.32 1.14 1.25 1.07 ...
##  $ month  : Factor w/ 12 levels "January","February",..: 1 2 3 4 5 6 7 8 9 10 ...
# Explore contents of dataset
head(river)
##   index nitrate    month
## 1     1   1.581  January
## 2     2   1.323 February
## 3     3   1.140    March
## 4     4   1.245    April
## 5     5   1.072      May
## 6     6   1.483     June
# Summary statistics of river nitrate concentrations
summary(river$nitrate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5920  0.9485  1.0680  1.0649  1.1700  1.8970
# Plot the distribution of nitrate concentration
boxplot(river$nitrate)

# Plot a histogram of the nitrate column
hist(river$nitrate)

# Add a Nitrate concentration label 
hist(river$nitrate, xlab="Nitrate concentration")

# Separate the histogram into 40 bins 
hist(river$nitrate, xlab = "Nitrate concentration", breaks = 40)

# Apply Grubbs' test to the nitrate data
outliers::grubbs.test(river$nitrate)
## 
##  Grubbs test for one outlier
## 
## data:  river$nitrate
## G = 4.72680, U = 0.92269, p-value = 0.000211
## alternative hypothesis: highest value 1.897 is an outlier
# Use which.max to find row index of the max
which.max(river$nitrate)
## [1] 156
# Runs Grubbs' test excluding row 156
outliers::grubbs.test(river$nitrate[-156])
## 
##  Grubbs test for one outlier
## 
## data:  river$nitrate[-156]
## G = 3.42980, U = 0.95915, p-value = 0.07756
## alternative hypothesis: highest value 1.643 is an outlier
# Print the value tested in the second Grubbs' test
min(river$nitrate[-156])
## [1] 0.592
# View contents of dataset
head(river)
##   index nitrate    month
## 1     1   1.581  January
## 2     2   1.323 February
## 3     3   1.140    March
## 4     4   1.245    April
## 5     5   1.072      May
## 6     6   1.483     June
# Show the time series of nitrate concentrations with time
plot(nitrate ~ month, data = river, type = "o")

# Calculate the mean nitrate by month
monthly_mean <- tapply(river$nitrate, river$month, FUN = mean)
monthly_mean
##   January  February     March     April       May      June      July 
## 1.2163600 1.1838400 1.1050400 1.0166250 0.9978333 0.9792083 0.9810417 
##    August September   October  November  December 
## 0.9380833 0.9885833 1.0360000 1.0962500 1.2264167
# Plot the monthly means 
plot(monthly_mean, type = "o", xlab = "Month", ylab = "Monthly mean")

# Create a boxplot of nitrate against months
boxplot(nitrate ~ month, data=river)

# Package 'anomalyDetection' was removed from the CRAN repository.
# Formerly available versions can be obtained from the archive. 
# Archived on 2019-03-01 as check problems were not corrected in time. 

# Run Seasonal-Hybrid ESD for nitrate concentrations
# AnomalyDetectionVec(river$nitrate, period=12, direction = 'both', plot = T)

# Use Seasonal-Hybrid ESD for nitrate concentrations
# river_anomalies <- AnomalyDetectionVec(x = river$nitrate, period = 12, direction = 'both', plot = T)

# Print the anomalies
# river_anomalies$anoms

# Print the plot
# print(river_anomalies$plot)

Chapter 2 - Distance and Density Based Anomaly Detection

k-Nearest-Neighbors Score:

  • Example dataset for heights and widths of furniture
    • plot(Width ~ Height, data = furniture)
  • Anomalies are usually far away from their neighbors
    • library(FNN)
    • furniture_knn <- get.knn(data = furniture, k = 5)
    • get.knn() returns two matrices - “nn.index” “nn.dist”
    • head(furniture_knn$nn.dist, 3) # distance matrix
    • furniture_score <- rowMeans(furniture_knn$nn.dist) # average distance to 5 nearest neighbors

Visualizing kNN distance:

  • Can be important to standardize distances prior to kNN (unless it is desirable that a single variable dominate the scoring)
    • furniture_scaled <- scale(furniture)
    • furniture_scaled <- scale(furniture)
    • furniture_knn <- get.knn(furniture_scaled, 5)
    • furniture\(score <- rowMeans(furniture_knn\)nn.dist)
    • plot(Width ~ Height, cex = sqrt(score), data = furniture, pch = 20)

Local outlier factor (LOF):

  • LOF uses density rather than distance to set up a distance for points - score is a ratio that tends to be near 1
    • kNN is helpful for finding global anomalies, while LOF is helpful for finding local anomalies
    • library(dbscan)
    • furniture_lof <- lof(scale(furniture), k = 5)
  • LOF is the density around the point relative to the density around the kNN
    • LOF > 1 is more likely to be anomalous while LOF < 1 is much less likely to be anomalous
    • furniture$score_lof <- furniture_lof
    • plot(Width ~ Height, data = furniture, cex = score_lof, pch = 20)

Example code includes:

wineOrig <- readr::read_csv("./RInputFiles/big_wine.csv")
## Parsed with column specification:
## cols(
##   fixed.acidity = col_double(),
##   volatile.acidity = col_double(),
##   citric.acid = col_double(),
##   residual.sugar = col_double(),
##   chlorides = col_double(),
##   free.sulfur.dioxide = col_double(),
##   total.sulfur.dioxide = col_double(),
##   density = col_double(),
##   pH = col_double(),
##   sulphates = col_double(),
##   alcohol = col_double(),
##   quality = col_double(),
##   good_wine = col_double()
## )
str(wineOrig, give.attr=FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 14694 obs. of  13 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ good_wine           : num  1 1 1 1 1 1 1 1 1 1 ...
wine <- wineOrig %>% select(pH, alcohol)

# View the contents of the wine data
head(wine)
## # A tibble: 6 x 2
##      pH alcohol
##   <dbl>   <dbl>
## 1  3        8.8
## 2  3.3      9.5
## 3  3.26    10.1
## 4  3.19     9.9
## 5  3.19     9.9
## 6  3.26    10.1
# Scatterplot of wine pH against alcohol
plot(pH ~ alcohol, data = wine)

# Calculate the 5 nearest neighbors distance
wine_nn <- FNN::get.knn(wine, k = 5)

# View the distance matrix
head(wine_nn$nn.dist)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0
## [6,]    0    0    0    0    0
# Distance from wine 5 to nearest neighbor
wine_nn$nn.dist[5, 1]
## [1] 0
# Row index of wine 5's nearest neighbor 
wine_nn$nn.ind[5, 1]
## [1] 9800
# Return data for wine 5 and its nearest neighbor
wine[c(5, wine_nn$nn.ind[5, 1]), ]
## # A tibble: 2 x 2
##      pH alcohol
##   <dbl>   <dbl>
## 1  3.19     9.9
## 2  3.19     9.9
# Create score by averaging distances
wine_nnd <- rowMeans(wine_nn$nn.dist)

# Print row index of the most anomalous point
which.max(wine_nnd)
## [1] 3919
# Observe differences in column scales 
summary(wine)
##        pH           alcohol     
##  Min.   :2.720   Min.   : 8.00  
##  1st Qu.:3.090   1st Qu.: 9.50  
##  Median :3.180   Median :10.40  
##  Mean   :3.188   Mean   :10.51  
##  3rd Qu.:3.280   3rd Qu.:11.40  
##  Max.   :3.820   Max.   :14.20
# Standardize the wine columns
wine_scaled <- scale(wine)

# Observe standardized column scales
summary(wine_scaled)
##        pH              alcohol        
##  Min.   :-3.10130   Min.   :-2.04323  
##  1st Qu.:-0.65081   1st Qu.:-0.82425  
##  Median :-0.05475   Median :-0.09286  
##  Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.60755   3rd Qu.: 0.71979  
##  Max.   : 4.18393   Max.   : 2.99522
# Print the 5-nearest neighbor distance score
wine_nnd[1:5]
## [1] 0 0 0 0 0
# Add the score as a new column 
wine$score <- wine_nnd


# Scatterplot showing pH, alcohol and kNN score
plot(pH ~ alcohol, data=wine, cex = sqrt(score), pch = 20)

# Calculate the LOF for wine data
wine$score <- NULL
wine_lof <- dbscan::lof(scale(wine), k=5)

# Append the LOF score as a new column
wine$score <- wine_lof


# Scatterplot showing pH, alcohol and LOF score
plot(pH ~ alcohol, data=wine, cex=score, pch=20)

# Calculate and append kNN distance as a new column
wine_nn <- FNN::get.knn(wine_scaled, k = 10)
wine$score_knn <- rowMeans(wine_nn$nn.dist)     

# Calculate and append LOF as a new column
wine$score_lof <- dbscan::lof(wine_scaled, k = 10)

# Find the row location of highest kNN
which.max(wine$score_knn)
## [1] 2957
# Find the row location of highest LOF
which.max(wine$score_lof)
## [1] 15

Chapter 3 - Isolation Forest

Isolation Trees:

  • Example of deer organizing in to a pack and predators trying to find isolated deer to attack
    • Points that are more easily separated from other points are considered to be more anomalous
    • Keep splitting data until every point lies within its own sub-region OR there are fewer than n points in each sub-region
  • Example of running the isolation forest
    • library(isofor)
    • furniture_tree <- iForest(data = furniture, nt = 1) # nt is the number of trees to be grown
    • furniture_score <- predict(furniture_tree, newdata = furniture) # generates the isolation score
  • The isolation score is the average number of random splits needed to isolate a point
    • The score returned by predict is normalized, with scores near 1 more likely to be anomalies (short path length, easy to split)

Isolation Forest:

  • Can use sampling to build multiple trees (called an isolation forest, with score averaged over trees)
    • furniture_tree <- iForest(data = furniture, nt = 1, phi = 100)
    • furniture_forest <- iForest(data = furniture, nt = 100)
  • The forest of many trees is more robust and faster to grow
    • The anomaly score should generally converge after “sufficient” trees have been run (100 is a typical default)
    • Scores should largely be stable if more or less trees are added
    • plot(trees_500 ~ trees_1000, data = furniture_scores)
    • abline(a = 0, b = 1)

Visualizing Isolation Scores:

  • Can visualize the isolation score using the contour plot
    • h_seq <- seq(min(furniture\(Height), max(furniture\)Height), length.out = 20)
    • w_seq <- seq(min(furniture\(Width), max(furniture\)Width), length.out = 20)
    • furniture_grid <- expand.grid(Width = w_seq, Height = h_seq)
    • furniture_grid$score <- predict(furniture_forest, furniture_grid)
    • library(lattice)
    • contourplot(score ~ Height + Width, data = furniture_grid, region = TRUE)

Example code includes:

wine <- wine %>% select(pH, alcohol)
str(wine, give.attr=FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 14694 obs. of  2 variables:
##  $ pH     : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ alcohol: num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
# CRAN - package 'isofor' is not available (for R version 3.5.1)

# Build an isolation tree 
# wine_tree <- iForest(wine, nt = 1)

# Create isolation score
# wine$tree_score <- predict(wine_tree, newdata = wine)

# Histogram plot of the scores
# hist(wine$tree_score, breaks=40)


# Fit isolation forest
# wine_forest <- iForest(wine, nt=100)

# Fit isolation forest
# wine_forest <- iForest(wine, nt = 100, phi = 200)

# Create isolation score from forest
# wine_score <- predict(wine_forest, newdata=wine)

# Append score to the wine data
# wine$score <- wine_score


# View the contents of the wine scores
# head(wine_scores)

# Score scatterplot 2000 vs 1000 trees 
# plot(trees_2000 ~ trees_1000, data = wine_scores)

# Add reference line of equality
# abline(a = 0, b = 1)


# Sequence of values for pH and alcohol
ph_seq <- seq(min(wine$pH), max(wine$pH), length.out = 25)
alcohol_seq <- seq(min(wine$alcohol), max(wine$alcohol) , length.out = 25)

# Create a data frame of grid coordinates
wine_grid <- expand.grid(pH = ph_seq, alcohol = alcohol_seq)

# Plot the grid
plot(pH ~ alcohol, data=wine_grid, pch = 20)

# Calculate isolation score at grid locations
# wine_grid$score <- predict(wine_forest, newdata=wine_grid)


# Contour plot of isolation scores
# contourplot(score ~ alcohol + pH, data=wine_grid, region = TRUE)

Chapter 4 - Comparing Performance

Labeled Anomalies:

  • Sometimes, the anomalies are already labelled, allowing for supervised learning
    • table(sat$label)
    • plot(V2 ~ V3, data = sat, col = as.factor(label), pch = 20)
    • sat_for <- iForest(sat[, -1], nt = 100)
    • sat$score <- predict(sat_for, features)
    • boxplot(score ~ label, data = sat, col = “olivedrab4”)
  • Challenges with modeling can include
    • Detecting rare cases
    • Rapidly changing exploits (e.g., card fraud)

Measuring Performance:

  • Decision threshholds are sometimes based off quantiles; flag a percentage of the data as anomalous
    • high_score <- quantile(sat$score, probs = 0.99)
    • sat$binary_score <- as.numeric(score >= high_score)
    • table(sat\(label, sat\)binary_score)
  • Can calculate recall as correctly detected anomalies divided by total actual anomalies (1 means all anomalies detected)
  • Can calculate precision as correctly detected anomalies divided by total predicted anomalies (1 means no false detections)

Working with Categorical Features:

  • Begin by checking for any non-numeric features - fct or chr classes
    • sapply(X = sat, FUN = class)
  • The isolation forest can take factor variables (but not character variables directly) provided they have at most 32 unique levels
    • sat\(high_low <- as.factor(sat\)high_low)
    • class(sat$high_low)
    • sat_for <- iForest(sat[, -1], nt = 100)
  • Can run LOF with factors using the Gower distance (all distances are standardized to be between 0 and 1) from the daisy package
    • library(cluster)
    • sat_dist <- daisy(sat[, -1], metric = “gower”)
    • sat_lof <- lof(sat_dist, k = 10)
    • sat_distmat <- as.matrix(sat_dist)
    • range(sat_distmat)

Wrap Up:

Example code includes:

thyroidOrig <- readr::read_csv("./RInputFiles/thyroid.csv")
## Parsed with column specification:
## cols(
##   label = col_double(),
##   TSH = col_double(),
##   T3 = col_double(),
##   TT4 = col_double(),
##   T4U = col_double(),
##   FTI = col_double(),
##   TBG = col_double()
## )
str(thyroidOrig, give.attr=FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of  7 variables:
##  $ label: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TSH  : num  -0.256 -1.397 -0.704 -0.389 -1.442 ...
##  $ T3   : num  -6.78 -7.66 -5.63 -6.38 -7.66 ...
##  $ TT4  : num  -1.98 -1.27 -1.5 -1.85 -1.42 ...
##  $ T4U  : num  -1.29 -1.11 -1.45 -1.74 -1.14 ...
##  $ FTI  : num  -1.218 -0.625 -0.643 -1.099 -1.099 ...
##  $ TBG  : num  -1.44 -1.75 -2.08 -1.99 -1.4 ...
thyroid <- thyroidOrig


# View contents of thryoid data
head(thyroid)
## # A tibble: 6 x 7
##   label    TSH    T3   TT4   T4U    FTI   TBG
##   <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1     0 -0.256 -6.78 -1.98 -1.29 -1.22  -1.44
## 2     0 -1.40  -7.66 -1.27 -1.11 -0.625 -1.75
## 3     0 -0.704 -5.63 -1.50 -1.45 -0.643 -2.08
## 4     0 -0.389 -6.38 -1.85 -1.74 -1.10  -1.99
## 5     0 -1.44  -7.66 -1.42 -1.14 -1.10  -1.40
## 6     0 -0.313 -7.66 -1.92 -1.63 -1.43  -1.62
# Tabulate the labels
table(thyroid$label)
## 
##   0   1 
## 978  22
# Proportion of thyroid cases
prop_disease <- mean(thyroid$label)


# Plot of TSH and T3
plot(TSH ~ T3, data=thyroid, pch=20)

# Plot of TSH, T3 and labels
plot(TSH ~ T3, data = thyroid, pch = 20, col = label + 1)

# Plot of TT4, TBG and labels
plot(TT4 ~ TBG, data = thyroid, pch = 20, col = label + 1)

# Package isofor not available on CRAN
# Fit isolation forest
# thyroid_forest <- isofor::iForest(thyroid[, -1], nt = 200)

# Anomaly score 
# thyroid$iso_score <- predict(thyroid_forest, thyroid[, -1])

# Boxplot of the anomaly score against labels
# boxplot(iso_score ~ label, data=thyroid, col = "olivedrab4")


# Create a LOF score for thyroid
lof_score <- dbscan::lof(scale(thyroid[, -1]), k = 10)
                 
# Calculate high threshold for lof_score
high_lof <- quantile(lof_score, probs = 0.98) 

# Append binary LOF score to thyroid data
thyroid$binary_lof <- as.numeric(lof_score >= high_lof)


iso_score <- c(394, 442, 408, 369, 431, 420, 398, 374, 384, 452, 478, 461, 356, 357, 405, 437, 357, 366, 488, 671, 395, 367, 346, 387, 354, 386, 411, 548, 423, 344, 355, 459, 413, 389, 373, 360, 520, 382, 690, 676, 388, 486, 530, 561, 423, 409, 352, 441, 395, 416, 367, 377, 426, 418, 378, 357, 422, 431, 526, 380, 450, 434, 462, 360, 529, 382, 390, 371, 385, 382, 367, 416, 384, 400, 377, 391, 380, 403, 361, 355, 418, 498, 649, 465, 413, 377, 383, 375, 422, 360, 353, 380, 569, 430, 377, 418, 374, 413, 369, 378, 456, 357, 559, 375, 370, 543, 410, 548, 380, 382, 362, 390, 460, 438, 392, 742, 665, 400, 393, 382, 382, 511, 375, 363, 422, 399, 358, 448, 399, 450, 392, 369, 435, 437, 375, 529, 370, 369, 429, 532, 485, 439, 429, 363, 366, 480, 408, 622, 358, 489, 520, 393, 388, 431, 378, 400, 400, 354, 405, 388, 416, 442, 382, 348, 347, 375, 366, 397, 467, 518, 387, 376, 353, 369, 442, 380, 391, 390, 358, 401, 409, 414, 452, 377, 362, 360, 380, 381, 412, 412, 418, 381, 432, 391, 448, 395, 418, 509, 525, 398, 432, 359, 499, 444, 383, 405, 467, 418, 721, 399, 421, 527, 481, 371, 364, 459, 398, 373, 388, 434, 428, 439, 381, 405, 352, 363, 352, 403, 362, 396, 367, 365, 432, 392, 396, 367, 404, 384, 381, 364, 366, 376, 369, 379, 379, 426, 401, 380, 404, 394, 368, 361, 393, 455, 396, 540, 368, 360, 466, 365, 377, 411, 442, 408, 373, 394, 344, 352, 345, 344, 346, 344, 378, 366, 401, 436, 366, 367, 382, 356, 362, 402, 405, 376, 368, 381, 371, 391, 359, 707, 367, 370, 387, 385, 373, 354, 354, 362, 358, 364, 353, 365, 374, 385, 395, 362, 461, 374, 362, 456, 405, 426, 385, 387, 387, 375, 503, 378, 370, 358, 377, 461, 357, 353, 346, 350, 393, 456, 425, 418, 371, 380, 477, 383, 382, 349, 360, 412, 395, 409, 441, 371, 420, 455, 358, 654, 365, 507, 508, 443, 364, 381, 468, 368, 362, 454, 381, 357, 432, 374, 379, 383, 389, 367, 393, 424, 378, 361, 512, 449, 522, 352, 354, 367, 359, 396, 486, 367, 409, 427, 351, 381, 357, 362, 369, 364, 356, 730, 353, 399, 383, 523, 429, 425, 420, 455, 414, 475, 433, 528, 425, 476, 352, 350, 413, 443, 435, 381, 472, 486, 376, 402, 361, 377, 391, 380, 355, 400, 394, 353, 379, 376, 469, 398, 464, 388, 378, 397, 396, 521, 417, 365, 420, 377, 350, 407, 364, 368, 426, 344, 351, 411, 412, 502, 381, 495, 350, 350, 344, 362, 389, 388, 370, 354, 394, 363, 564, 549, 387, 378, 411, 421, 427, 382, 385, 496, 372, 416, 365, 375, 406, 355, 362, 442, 410, 477, 361, 379, 386, 375, 351, 351, 360, 360, 412, 400, 409, 458, 351, 376, 400, 360, 499, 362, 476, 396, 407, 437, 358, 385, 373, 432, 353, 352, 369, 405, 376, 383, 462, 375, 361, 395, 426, 431, 418, 500, 585, 616, 372, 529, 418, 456, 360, 429, 397, 366, 384, 359, 515, 401, 389, 429, 371, 357, 398, 380, 371, 354, 403, 355, 356, 368, 363, 481, 545, 367, 350, 345, 344, 344, 426, 438, 464, 365, 460, 462, 419, 358, 428, 433, 352, 384, 416, 387) 
iso_score <- c(iso_score, 384, 366, 367, 487, 628, 638, 472, 349, 351, 351, 420, 422, 347, 347, 422, 396, 443, 419, 509, 507, 398, 375, 427, 492, 388, 387, 354, 390, 439, 358, 392, 379, 361, 392, 375, 407, 663, 442, 390, 437, 432, 420, 397, 413, 477, 494, 522, 354, 354, 357, 381, 384, 412, 406, 411, 448, 508, 411, 352, 345, 511, 386, 364, 396, 476, 389, 355, 464, 363, 380, 366, 423, 396, 407, 415, 504, 440, 406, 449, 394, 432, 397, 428, 434, 401, 363, 395, 404, 392, 454, 357, 380, 352, 382, 389, 389, 345, 348, 461, 390, 371, 345, 442, 402, 386, 375, 382, 382, 404, 373, 586, 426, 600, 368, 382, 358, 407, 379, 402, 367, 366, 385, 706, 352, 384, 363, 486, 366, 433, 373, 397, 434, 402, 378, 376, 376, 359, 380, 363, 351, 543, 435, 385, 503, 359, 353, 365, 405, 457, 345, 463, 445, 363, 353, 369, 370, 355, 681, 439, 360, 417, 383, 376, 416, 428, 386, 426, 420, 462, 370, 367, 398, 373, 354, 418, 364, 357, 420, 628, 442, 403, 478, 370, 367, 399, 413, 453, 423, 376, 385, 415, 447, 349, 551, 390, 438, 384, 401, 458, 526, 449, 480, 405, 388, 391, 361, 362, 387, 429, 391, 413, 391, 380, 511, 411, 376, 374, 436, 362, 434, 437, 517, 397, 406, 372, 345, 345, 345, 347, 423, 639, 373, 397, 358, 369, 399, 464, 453, 406, 358, 350, 395, 386, 454, 396, 373, 394, 444, 377, 376, 459, 393, 353, 349, 685, 382, 419, 394, 446, 346, 442, 426, 390, 422, 568, 365, 443, 353, 513, 364, 349, 373, 422, 389, 509, 411, 443, 375, 438, 556, 349, 445, 446, 413, 455, 419, 385, 358, 381, 375, 372, 497, 589, 386, 493, 539, 769, 351, 511, 456, 373, 378, 411, 523, 448, 400, 368, 428, 381, 444, 378, 402, 377, 411, 367, 446, 374, 435, 429, 409, 399, 349, 360, 468, 400, 366, 372, 446, 384, 524, 348, 384, 371, 381, 347, 357, 369, 359, 405, 390, 363, 419, 469, 410, 413, 365, 377, 482, 398, 347, 467, 446, 442, 399, 367, 502, 424, 452, 364, 372, 355, 386, 399, 399, 370, 409, 412, 409, 396, 380, 446, 470, 375, 386, 454, 350, 514, 396, 411, 402, 360, 458, 439, 349, 345, 398, 368, 378, 355, 528, 384, 397, 543, 410, 370, 389, 506, 412, 454, 442, 602, 383, 367, 377, 489, 371, 471, 361, 366, 355, 508, 368, 390, 368, 375, 406, 512, 374, 380, 378, 344, 381, 400, 544, 375, 527, 390, 398, 455, 393, 427, 435, 512, 379, 367, 380)
iso_score <- iso_score / 1000


# Calculate high threshold for iso_score
high_iso <- quantile(iso_score, probs=0.98)  

# Append binary isolation score to thyroid data
thyroid$binary_iso <- as.numeric(iso_score >= high_iso)         


# Tabulate agreement of label and binary isolation score 
table(thyroid$label, thyroid$binary_iso)
##    
##       0   1
##   0 970   8
##   1  10  12
# Tabulate agreement of label and binary LOF score 
table(thyroid$label, thyroid$binary_lof)
##    
##       0   1
##   0 958  20
##   1  22   0
# Proportion of binary_iso and label that agree
iso_prop <- mean(thyroid$label == thyroid$binary_iso)

# Proportion of binary_lof and label that agree
lof_prop <- mean(thyroid$label == thyroid$binary_lof)


table(thyroid$label, thyroid$binary_iso)
##    
##       0   1
##   0 970   8
##   1  10  12
table(thyroid$label, thyroid$binary_lof)
##    
##       0   1
##   0 958  20
##   1  22   0
# Precision for binary scores
precision_iso <- sum(thyroid$label == 1 & thyroid$binary_iso == 1) / sum(thyroid$binary_iso == 1)
precision_lof <- sum(thyroid$label == 1 & thyroid$binary_lof == 1) / sum(thyroid$binary_lof == 1)

# Recall for binary scores
recall_iso <- sum(thyroid$label == 1 & thyroid$binary_iso == 1) / sum(thyroid$label == 1)
recall_lof <- sum(thyroid$label == 1 & thyroid$binary_lof == 1) / sum(thyroid$label == 1)


age <- c('35-60', '0-35', '35-60', '60+', '0-35', '0-35', '0-35', '0-35', '35-60', '35-60', '35-60', '35-60', '0-35', '35-60', '35-60', '35-60', '0-35', '0-35', '35-60', '60+', '60+', '35-60', '60+', '35-60', '35-60', '60+', '60+', '0-35', '60+', '35-60', '35-60', '0-35', '0-35', '60+', '35-60', '60+', '60+', '60+', '60+', '35-60', '0-35', '0-35', '0-35', '60+', '0-35', '0-35', '0-35', '35-60', '0-35', '60+', '35-60', '60+', '0-35', '35-60', '0-35', '35-60', '35-60', '35-60', '35-60', '35-60', '0-35', '0-35', '35-60', '35-60', '60+', '60+', '35-60', '35-60', '0-35', '0-35', '0-35', '35-60', '35-60', '35-60', '0-35', '35-60', '35-60', '60+', '0-35', '0-35', '60+', '35-60', '0-35', '35-60', '0-35', '0-35', '60+', '0-35', '0-35', '0-35', '35-60', '60+', '35-60', '35-60', '35-60', '60+', '0-35', '35-60', '60+', '0-35', '35-60', '0-35', '35-60', '35-60', '60+', '60+', '60+', '35-60', '60+', '35-60', '60+', '60+', '0-35', '35-60', '60+', '60+', '0-35', '60+', '35-60', '60+', '0-35', '35-60', '0-35', '35-60', '0-35', '0-35', '35-60', '35-60', '60+', '60+', '60+', '0-35', '60+', '0-35', '0-35', '0-35', '60+', '60+', '0-35', '35-60', '35-60', '0-35', '0-35', '60+', '0-35', '60+', '35-60', '35-60', '0-35', '60+', '60+', '0-35', '0-35', '0-35', '35-60', '0-35', '0-35', '0-35', '60+', '0-35', '60+', '35-60', '35-60', '35-60', '60+', '0-35', '60+', '60+', '60+', '35-60', '35-60', '60+', '60+', '60+', '60+', '35-60', '0-35', '0-35', '35-60', '35-60', '35-60', '0-35', '35-60', '35-60', '35-60', '35-60', '60+', '60+', '60+', '0-35', '0-35', '0-35', '0-35', '35-60', '60+', '35-60', '35-60', '0-35', '60+', '60+', '0-35', '35-60', '35-60', '60+', '0-35', '60+', '60+', '60+', '0-35', '60+', '60+', '60+', '60+', '35-60', '0-35', '60+', '60+', '35-60', '60+', '0-35', '0-35', '60+', '60+', '60+', '0-35', '0-35', '35-60', '60+', '60+', '35-60', '35-60', '35-60', '60+', '0-35', '60+', '0-35', '60+', '35-60', '60+', '60+', '0-35', '35-60', '35-60', '0-35', '35-60', '60+', '0-35', '60+') 
age <- c(age, '60+', '60+', '0-35', '60+', '35-60', '0-35', '0-35', '60+', '35-60', '35-60', '0-35', '60+', '60+', '0-35', '60+', '35-60', '0-35', '35-60', '0-35', '35-60', '0-35', '60+', '0-35', '60+', '60+', '0-35', '60+', '60+', '60+', '60+', '60+', '0-35', '0-35', '60+', '0-35', '60+', '0-35', '60+', '60+', '35-60', '35-60', '60+', '60+', '60+', '60+', '60+', '35-60', '35-60', '60+', '0-35', '35-60', '0-35', '35-60', '0-35', '35-60', '35-60', '60+', '60+', '0-35', '35-60', '0-35', '60+', '0-35', '35-60', '60+', '35-60', '60+', '0-35', '0-35', '0-35', '0-35', '35-60', '60+', '35-60', '0-35', '60+', '60+', '0-35', '0-35', '35-60', '0-35', '60+', '0-35', '35-60', '60+', '0-35', '0-35', '60+', '35-60', '60+', '0-35', '60+', '0-35', '60+', '35-60', '0-35', '35-60', '0-35', '60+', '60+', '0-35', '60+', '60+', '60+', '35-60', '35-60', '35-60', '35-60', '60+', '60+', '60+', '60+', '35-60', '0-35', '60+', '0-35', '0-35', '35-60', '35-60', '35-60', '60+', '60+', '60+', '35-60', '35-60', '0-35', '35-60', '35-60', '60+', '0-35', '35-60', '0-35', '35-60', '35-60', '60+', '60+', '0-35', '35-60', '0-35', '35-60', '0-35', '0-35', '35-60', '60+', '0-35', '60+', '0-35', '60+', '0-35', '35-60', '0-35', '60+', '60+', '60+', '35-60', '60+', '60+', '35-60', '60+', '0-35', '0-35', '0-35', '60+', '60+', '60+', '35-60', '0-35', '0-35', '35-60', '0-35', '0-35', '35-60', '35-60', '35-60', '60+', '60+', '0-35', '60+', '0-35', '60+', '35-60', '35-60', '0-35', '0-35', '60+', '0-35', '35-60', '0-35', '0-35', '0-35', '0-35', '60+', '35-60', '60+', '35-60', '0-35', '60+', '35-60', '35-60', '60+', '35-60', '0-35', '60+', '60+', '35-60', '0-35', '60+', '35-60', '60+', '0-35', '60+', '0-35', '35-60', '0-35', '0-35', '35-60', '35-60', '0-35', '35-60', '60+', '35-60', '35-60', '60+', '60+', '0-35', '35-60', '0-35', '60+', '0-35', '35-60', '0-35', '0-35', '60+', '0-35', '60+', '0-35', '60+', '60+', '35-60', '35-60', '60+', '0-35', '35-60', '0-35', '35-60', '0-35', '35-60', '35-60', '35-60', '0-35', '60+', '35-60', '60+', '60+', '60+', '0-35', '0-35', '35-60', '0-35', '0-35', '0-35', '60+', '60+', '60+', '0-35', '0-35', '0-35', '60+', '0-35', '35-60', '0-35', '35-60', '35-60', '60+', '60+', '60+', '0-35', '35-60', '35-60', '35-60', '60+', '60+', '60+', '60+', '35-60', '60+', '0-35', '60+', '35-60', '0-35', '35-60', '60+', '35-60', '0-35', '35-60', '35-60', '0-35', '0-35', '35-60', '60+', '60+', '35-60', '60+', '35-60', '0-35', '0-35', '0-35', '0-35', '60+', '60+', '60+', '60+', '35-60', '35-60', '35-60', '0-35', '0-35', '35-60', '60+', '0-35', '35-60', '60+', '0-35', '35-60', '35-60', '0-35', '35-60', '60+', '35-60', '35-60', '60+', '60+', '35-60', '0-35', '60+', '35-60', '0-35', '35-60', '35-60', '60+', '35-60', '60+', '60+', '60+', '60+', '0-35', '0-35', '60+', '35-60', '0-35', '60+', '35-60', '60+', '60+', '60+', '35-60', '0-35', '0-35', '60+', '0-35', '35-60', '35-60', '35-60', '60+', '0-35', '35-60', '60+', '35-60', '0-35', '60+', '0-35', '35-60')
age <- c(age, '60+', '35-60', '60+', '0-35', '60+', '35-60', '35-60', '0-35', '35-60', '60+', '60+', '60+', '60+', '60+', '0-35', '60+', '60+', '35-60', '60+', '60+', '60+', '0-35', '60+', '35-60', '60+', '35-60', '35-60', '0-35', '35-60', '0-35', '35-60', '0-35', '35-60', '60+', '0-35', '0-35', '35-60', '0-35', '35-60', '35-60', '60+', '60+', '60+', '60+', '0-35', '0-35', '0-35', '35-60', '35-60', '60+', '35-60', '35-60', '35-60', '0-35', '35-60', '0-35', '35-60', '35-60', '60+', '35-60', '60+', '35-60', '35-60', '0-35', '0-35', '35-60', '60+', '0-35', '0-35', '60+', '60+', '35-60', '0-35', '60+', '35-60', '60+', '60+', '0-35', '0-35', '35-60', '0-35', '35-60', '60+', '0-35', '0-35', '0-35', '60+', '35-60', '60+', '60+', '35-60', '35-60', '60+', '0-35', '60+', '60+', '60+', '35-60', '60+', '60+', '60+', '0-35', '60+', '60+', '0-35', '0-35', '0-35', '60+', '0-35', '0-35', '0-35', '60+', '60+', '60+', '0-35', '60+', '60+', '0-35', '35-60', '35-60', '35-60', '0-35', '35-60', '35-60', '35-60', '0-35', '35-60', '35-60', '35-60', '35-60', '60+', '60+', '35-60', '35-60', '0-35', '60+', '0-35', '35-60', '35-60', '35-60', '0-35', '35-60', '35-60', '60+', '60+', '60+', '0-35', '60+', '0-35', '0-35', '0-35', '0-35', '0-35', '0-35', '0-35', '0-35', '60+', '60+', '35-60', '0-35', '0-35', '0-35', '35-60', '35-60', '35-60', '60+', '0-35', '0-35', '35-60', '35-60', '35-60', '60+', '60+', '60+', '60+', '35-60', '0-35', '0-35', '60+', '60+', '35-60', '35-60', '60+', '0-35', '60+', '35-60', '35-60', '0-35', '0-35', '0-35', '0-35', '35-60', '0-35', '60+', '0-35', '35-60', '60+', '60+', '0-35', '35-60', '35-60', '60+', '35-60', '0-35', '0-35', '0-35', '60+', '0-35', '60+', '0-35', '0-35', '60+', '0-35', '35-60', '35-60', '35-60', '35-60', '0-35', '0-35', '35-60', '60+', '0-35', '60+', '0-35', '0-35', '60+', '60+', '0-35', '60+', '35-60', '35-60', '0-35', '0-35', '35-60', '60+', '0-35', '60+', '60+', '60+', '35-60', '35-60', '0-35', '60+', '60+', '35-60', '60+', '60+', '60+', '35-60', '0-35', '35-60', '60+', '60+', '35-60', '0-35', '60+', '0-35', '35-60', '35-60', '60+', '60+', '35-60', '0-35', '60+', '60+', '35-60', '35-60', '0-35', '35-60', '35-60', '0-35', '0-35', '60+', '0-35', '60+', '35-60', '35-60', '60+', '0-35', '60+', '60+', '35-60', '0-35', '0-35', '0-35', '60+', '0-35', '35-60', '35-60', '0-35', '0-35', '35-60', '35-60', '60+', '35-60', '60+', '60+', '0-35', '35-60', '0-35', '0-35', '35-60', '60+', '0-35', '35-60', '35-60', '60+', '60+', '35-60', '0-35', '0-35', '60+', '60+', '0-35', '0-35', '60+', '60+', '60+', '35-60', '0-35', '0-35', '0-35', '0-35', '60+', '60+', '0-35', '35-60', '60+', '0-35', '35-60', '60+', '60+', '35-60', '60+', '60+', '35-60', '0-35', '60+', '60+', '0-35', '35-60', '35-60', '35-60', '60+', '60+', '0-35', '60+', '35-60', '0-35', '0-35', '0-35', '35-60', '35-60', '60+', '0-35', '35-60', '0-35', '35-60', '35-60', '35-60', '60+', '0-35', '35-60', '60+', '60+', '60+', '35-60', '0-35', '0-35', '0-35', '0-35', '0-35', '60+', '60+', '35-60', '60+', '0-35', '35-60', '0-35', '60+')

sex <- c('F', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'F')
sex <- c(sex, 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'F')


thyroid$age <- age
thyroid$sex <- sex

# Print the column classes in thyroid
sapply(X = thyroid, FUN = class)
##       label         TSH          T3         TT4         T4U         FTI 
##   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
##         TBG  binary_lof  binary_iso         age         sex 
##   "numeric"   "numeric"   "numeric" "character" "character"
# Convert column with character class to factor
thyroid$age <- as.factor(thyroid$age)
thyroid$sex <- as.factor(thyroid$sex)

# Check that all columns are factor or numeric
sapply(X = thyroid, FUN = class)
##      label        TSH         T3        TT4        T4U        FTI 
##  "numeric"  "numeric"  "numeric"  "numeric"  "numeric"  "numeric" 
##        TBG binary_lof binary_iso        age        sex 
##  "numeric"  "numeric"  "numeric"   "factor"   "factor"
# Check the class of age column
class(thyroid$age)
## [1] "factor"
# Check the class of sex column
class(thyroid$sex)
## [1] "factor"
# Fit an isolation forest with 100 trees
# thyroid_for <- iForest(thyroid[, -1], nt=100)


# Calculate Gower's distance matrix
thyroid_dist <- cluster::daisy(thyroid[, -1], metric = "gower")
## Warning in cluster::daisy(thyroid[, -1], metric = "gower"): binary
## variable(s) 7, 8 treated as interval scaled
# Generate LOF scores for thyroid data
thyroid_lof <- dbscan::lof(thyroid_dist, k = 10)

# Range of values in the distance matrix
range(as.matrix(thyroid_dist))
## [1] 0.0000000 0.6718958

GARCH Models in R

Chapter 1 - Standard GARCH Model as the Workhorse

Analyzing volatility:

  • GARCH models in R help maximize returns while properly managing risk
    • Relative gains are key; gain divided by starting point
  • Properties of daily returns include that average return is zero and variability changes over time
    • sd(sp500ret) # daily
    • sqrt(252)*sd(sp500ret) # annualized, assuming 252 trading days
  • Can use rolling windows to estimate how variability is changing over time - commonly, window is a multiple of 22 (trading days in a month)
    • library(PerformanceAnalytics)
    • chart.RollingPerformance(R = sp500ret , width = 22, FUN = “sd.annualized”, scale = 252, main = “Rolling 1 month volatility”)
  • The GARCH model helps drive greater precision and accuracy than the simple rolling window

GARCH equation for volatility prediction:

  • GARCH models attempt to model forward-looking volatility
    • Input is a time series of returns
    • May want to predict a future return based on all current and previous returns - arithmeitc mean or ARMA
    • Can also predict the future volatility based on previous returns and volatilities - weighted average (more recent weighted more highly) most recent prediction errors
  • ARCH is the overall model while GARCH(1, 1) is a generalized form of the model
    • omega, alpha, beta should all be positive
    • alpha plus beta must be less than 1 (meaning that variance always reverts to its long-run average)
  • Example of implementing GARCH variance using R (loop starting at 2 because of the lagged predictor)
    • alpha <- 0.1
    • beta <- 0.8
    • omega <- var(sp500ret)*(1-alpha-beta) # Then: var(sp500ret) = omega/(1-alpha-beta)
    • e <- sp500ret - mean(sp500ret) # Constant mean
    • e2 <- e^2
    • nobs <- length(sp500ret)
    • predvar <- rep(NA, nobs)
    • predvar[1] <- var(sp500ret) # Initialize the process at the sample variance
    • for (t in 2:nobs){
    • predvar[t] <- omega + alpha * e2[t - 1] + beta * predvar[t-1] # GARCH(1,1) equation
    • }
    • predvol <- sqrt(predvar) # Volatility is sqrt of predicted variance
    • predvol <- xts(predvol, order.by = time(sp500ret))
    • uncvol <- sqrt(omega / (1 - alpha-beta)) # We compare with the unconditional volatility
    • uncvol <- xts(rep(uncvol, nobs), order.by = time(sp500ret))

rugarch package:

  • The normal GARCH(1, 1) model with a constant mean is a starting point
    • Can use maximum likelihood to estimate mean, omega, alpha, and beta
    • library(rugarch)
    • garchspec <- ugarchspec( mean.model = list(armaOrder = c(0,0)), variance.model = list(model = “sGARCH”), distribution.model = “norm”) # first step is ugarchspec
    • garchfit <- ugarchfit(data = sp500ret , spec = garchspec) # ugarchfit is the second step
    • garchforecast <- ugarchforecast(fitORspec = garchfit, n.ahead = 5) # ugarchforecast is final step
  • Can access the results in a ugarchfit object
    • The ugarchfit yields an object that contains all the results related to the estimation of the garch model
    • Methods coef, uncvar, fitted and sigma:
    • garchcoef <- coef(garchfit)
    • garchuncvar <- uncvariance(garchfit)
    • garchmean <- fitted(garchfit)
    • garchvol <- sigma(garchfit)
    • sigma(garchforecast)
    • fitted(garchforecast)
  • Frequently, portfolios are set to a target volatility, with a portion of the portfolio held in cash to manage that

Example code includes:

library(xts)
library(PerformanceAnalytics)


load("./RInputFiles/sp500prices.RData")
str(sp500prices)


# Plot daily S&P 500 prices
plot(sp500prices)

# Compute daily returns
sp500ret <- CalculateReturns(sp500prices)

# Check the class of sp500ret
class(sp500ret)

# Plot daily returns
plot(sp500ret)


# Compute the daily standard deviation for the complete sample   
sd(sp500ret)

# Compute the annualized volatility for the complete sample
sd(sp500ret) * sqrt(252)

# Compute the annualized standard deviation for the year 2009 
sqrt(252) * sd(sp500ret["2009"])

# Compute the annualized standard deviation for the year 2017 
sqrt(252) * sd(sp500ret["2017"])


# Showing two plots onthe same figure
par(mfrow=c(2,1)) 

# Compute the rolling 1 month estimate of annualized volatility
chart.RollingPerformance(R = sp500ret["2000::2017"], width = 22,
     FUN = "sd.annualized", scale = 252, main = "One month rolling volatility")

# Compute the rolling 3 months estimate of annualized volatility
chart.RollingPerformance(R = sp500ret["2000::2017"], width = 66,
     FUN = "sd.annualized", scale = 252, main = "Three months rolling volatility")

par(mfrow=c(1,1)) 


sp500ret <- sp500ret[2:length(sp500ret), ]

# Compute the mean daily return
m <- mean(sp500ret)

# Define the series of prediction errors
e <- sp500ret - m

# Plot the absolute value of the prediction errors
par(mfrow = c(2,1), mar = c(3, 2, 2, 2))
plot(abs(e))

# Plot the acf of the absolute prediction errors
acf(abs(e))
par(mfrow = c(1,1), mar = c(5.1, 4.1, 4.1, 2.1))


nobs <- length(sp500ret)
predvar <- numeric(nobs)
omega <- 1.2086e-05
alpha <- 0.1
beta <- 0.8
e2 <- e**2

# Compute the predicted variances
predvar[1] <- var(sp500ret) 
for(t in 2:nobs){
   predvar[t] <- omega + alpha * e2[t-1] + beta * predvar[t-1]
}

# Create annualized predicted volatility
ann_predvol <- xts(sqrt(predvar) * sqrt(252), order.by = time(sp500ret))

# Plot the annual predicted volatility in 2008 and 2009
plot(ann_predvol["2008::2009"], main = "Ann. S&P 500 vol in 2008-2009")


# Specify a standard GARCH model with constant mean
garchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(0, 0)),
                 variance.model = list(model = "sGARCH"), 
                 distribution.model = "norm")

# Estimate the model
garchfit <- rugarch::ugarchfit(data = sp500ret, spec = garchspec)

# Use the method sigma to retrieve the estimated volatilities 
garchvol <- rugarch::sigma(garchfit) 

# Plot the volatility for 2017
plot(garchvol["2017"])


# Compute unconditional volatility
sqrt(rugarch::uncvariance(garchfit))

# Print last 10 ones in garchvol
tail(garchvol, 10)

# Forecast volatility 5 days ahead and add 
garchforecast <- rugarch::ugarchforecast(fitORspec = garchfit, n.ahead = 5)

# Extract the predicted volatilities and print them
print(rugarch::sigma(garchforecast))


# Compute the annualized volatility
annualvol <- sqrt(252) * rugarch::sigma(garchfit)

# Compute the 5% vol target weights  
vt_weights <- 0.05 / annualvol

# Compare the annualized volatility to the portfolio weights in a plot
plot(merge(annualvol, vt_weights), multi.panel = TRUE)

Chapter 2 - Improvements of the Normal GARCH Model

Non-normality of standardized returns:

  • Normal distributions are generally inconsistent with stock markets - “stairs up and elevators down”
    • garchspec <- ugarchspec( mean.model=list(armaOrder=c(0,0)), variance.model=list(model=“sGARCH”), distribution.model = “sstd”)
    • Caveat: The normality of the standardized returns follows from an assumption
    • Let’s compute the standardized returns and test whether the assumption is correct.
    • stdret <- residuals(garchfit, standardize = TRUE)
    • chart.Histogram(sp500ret, methods = c(“add.normal”, “add.density”), colorset=c(“gray”,“red”,“blue”))
  • A more realistic distribution needs fat tails and skew
    • The shape (nu) determines the fatness of the tails
    • The skew parameter of 1 is symmetry while less than 1 is negative skew
    • garchspec <- ugarchspec(mean.model = list(armaOrder = c(0,0)), variance.model = list(model = “sGARCH”), distribution.model = “sstd”)

Leverage effect:

  • Volatility predictions are unsigned, though the sign of the error matters (large negative returns drive more volatility than large positive returns)
    • In the case of a negative surprise, it is common to scale-up the squared prediction error - alpha becomes (alpha + gamma) with gamma being positive
    • GJR model proposed Glosten, Jagannathan and Runkle.
    • garchspec <- ugarchspec( mean.model=list(armaOrder=c(0,0)), variance.model=list(model=“gjrGARCH”), distribution.model = “sstd”)
    • garchfit <- ugarchfit(data = msftret, spec = garchspec)
    • out <- newsimpact(garchfit)
    • plot(out\(zx, out\)zy, xlab = “prediction error”, ylab = “predicted variance”)

Mean model:

  • Higher returns come with greater risks; can quantify these using GARCH mean models
    • The lambda parameter is typically a positive parameter that relates increases in volatility to increases in return
    • garchspec <- ugarchspec( mean.model = list(armaOrder = c(0,0), archm = TRUE, archpow = 2), variance.model = list(model = “gjrGARCH”), distribution.model = “sstd”)
    • garchfit <- ugarchfit( data = sp500ret , spec = garchspec)
    • plot(fitted(garchfit))
  • The AR(1) model is a common approach used to drive correlation in returns - can be a negative or a positive parameter
    • garchspec <- ugarchspec( mean.model = list(armaOrder = c(1,0)), variance.model = list(model = “gjrGARCH”), distribution.model = “sstd”)
    • garchfit <- ugarchfit(data = sp500ret, spec = garchspec)
    • round(coef(garchfit)[1:2], 4)
  • The MA(1) and ARMA(1, 1) are also popular models
    • garchspec <- ugarchspec( mean.model = list(armaOrder = c(0,1)), variance.model = list(model = “gjrGARCH”), distribution.model = “sstd”)
    • garchspec <- ugarchspec( mean.model = list(armaOrder = c(1,1)), variance.model = list(model = “gjrGARCH”), distribution.model = “sstd”)

Avoid unnecessary complexity:

  • Complexity has a price including loss of parsimony; avoid unneeded complexity
  • Can use setfixed() or setbounds() to set boundaries for the data
    • setfixed(garchspec) <- list(alpha1 = 0.05, shape = 6)
    • garchfit <- ugarchfit(data = EURUSDret, spec = garchspec)
    • setbounds(garchspec) <- list(“alpha1” = c(0.05,0.2), “beta1” = c(0.8,0.95))
  • Can also use variance targeting
    • garchspec <- ugarchspec(mean.model = list(armaOrder = c(0,0)), variance.model = list(model = “sGARCH”, variance.targeting = TRUE), distribution.model = “std”)
    • garchfit <- ugarchfit(data = EURUSDret, spec = garchspec)

Example code includes:

load("./RInputFiles/ret.RData")
str(ret)


# Plot the return series
plot(ret)

# Specify the garch model to be used
garchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(0,0)),
                                 variance.model = list(model = "sGARCH"),
                                 distribution.model = "sstd"
                                 )

# Estimate the model
garchfit <- rugarch::ugarchfit(data = ret, spec = garchspec)

# Inspect the coefficients
rugarch::coef(garchfit)


# Compute the standardized returns
stdret <- rugarch::residuals(garchfit, standardize = TRUE)

# Compute the standardized returns using fitted() and sigma()
stdret <- (ret - rugarch::fitted(garchfit)) / rugarch::sigma(garchfit)

# Load the package PerformanceAnalytics and make the histogram
chart.Histogram(stdret, methods = c("add.normal","add.density" ), colorset = c("gray","red","blue"))


load("./RInputFiles/msftret.RData")
str(msftret)


# Specify the GJR GARCH model
garchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(0,0)),
                                 variance.model = list(model = "gjrGARCH"),
                                 distribution.model = "sstd"
                                 )

# Estimate the model and compute volatility
gjrgarchfit <- rugarch::ugarchfit(data = msftret, spec = garchspec)
gjrgarchvol <- rugarch::sigma(gjrgarchfit)

# Compare volatility
plotvol <- plot(abs(msftret), col = "grey")
plotvol <- addSeries(gjrgarchvol, col = "red", on=1)
# plotvol <- addSeries(sgarchvol, col = "blue", on=1)
plotvol


# Specify AR(1)-GJR GARCH model
garchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(1, 0)),
                                 variance.model = list(model = "gjrGARCH"),
                                 distribution.model = "sstd"
                                 )

# Estimate the model
garchfit <- rugarch::ugarchfit(data=msftret, spec=garchspec)

# Print the first two coefficients
rugarch::coef(garchfit)[c(1:2)]


# GARCH-in-Mean specification and estimation
gim_garchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(0,0), archm = TRUE, archpow = 2),
                                     variance.model = list(model = "gjrGARCH"), 
                                     distribution.model = "sstd"
                                     )
gim_garchfit <- rugarch::ugarchfit(data = msftret , spec = gim_garchspec)

# Predicted mean returns and volatility of GARCH-in-mean
gim_mean <- rugarch::fitted(gim_garchfit)
gim_vol <- rugarch::sigma(gim_garchfit)


ar1_garchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(1,0), archm = TRUE, archpow = 2),
                                     variance.model = list(model = "sGARCH"), 
                                     distribution.model = "sstd"
                                     )
ar1_garchfit <- rugarch::ugarchfit(data = msftret , spec = ar1_garchspec)
ar1_mean <- rugarch::fitted(ar1_garchfit)
ar1_vol <- rugarch::sigma(ar1_garchfit)

cmu_garchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(0,0), archm = TRUE, archpow = 2),
                                     variance.model = list(model = "sGARCH"), 
                                     distribution.model = "sstd"
                                     )
cmu_garchfit <- rugarch::ugarchfit(data = msftret , spec = cmu_garchspec)
constmean_mean <- rugarch::fitted(cmu_garchfit)
constmean_vol <- rugarch::sigma(cmu_garchfit)

# Correlation between predicted return using AR(1) and GARCH-in-mean models
cor(ar1_mean, gim_mean)

# Correlation between predicted volatilities across mean.models
cor(merge(constmean_vol, ar1_vol, gim_vol))


load("./RInputFiles/EURUSDret.RData")
str(EURUSDret)


flexgarchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(1,0), archm = FALSE),
                                     variance.model = list(model = "sGARCH"), 
                                     distribution.model = "sstd"
                                     )
flexgarchfit <- rugarch::ugarchfit(data = EURUSDret , spec = flexgarchspec)


# Print the flexible GARCH parameters
rugarch::coef(flexgarchfit)

# Restrict the flexible GARCH model by impose a fixed ar1 and skew parameter
rflexgarchspec <- flexgarchspec
rugarch::setfixed(rflexgarchspec) <- list(ar1 = 0, skew = 1)

# Estimate the restricted GARCH model
rflexgarchfit <- rugarch::ugarchfit(data = EURUSDret,  spec = rflexgarchspec)

# Compare the volatility of the unrestricted and restriced GARCH models
plotvol <- plot(abs(EURUSDret), col = "grey")
plotvol <- addSeries(rugarch::sigma(flexgarchfit), col = "black", lwd = 4, on=1 )
plotvol <- addSeries(rugarch::sigma(rflexgarchfit), col = "red", on=1)
plotvol


# Define bflexgarchspec as the bound constrained version
bflexgarchspec <- flexgarchspec
rugarch::setbounds(bflexgarchspec) <- list(alpha1 = c(0.05, 0.2), beta1 = c(0.8, 0.95))

# Estimate the bound constrained model
bflexgarchfit <- rugarch::ugarchfit(data = EURUSDret, spec = bflexgarchspec)

# Inspect coefficients
rugarch::coef(bflexgarchfit)

# Compare forecasts for the next ten days
cbind(rugarch::sigma(rugarch::ugarchforecast(flexgarchfit, n.ahead = 10)),
      rugarch::sigma(rugarch::ugarchforecast(bflexgarchfit, n.ahead = 10))
      )


# Complete the specification to do variance targeting
garchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(0,0)),
                                 variance.model = list(model = "sGARCH", variance.targeting = TRUE),
                                 distribution.model = "std"
                                 )

# Estimate the model
garchfit <- rugarch::ugarchfit(data = EURUSDret, spec = garchspec)

# Print the GARCH model implied long run volatility
sqrt(rugarch::uncvariance(garchfit))

# Verify that it equals the standard deviation (after rounding)
all.equal(sqrt(rugarch::uncvariance(garchfit)), sd(EURUSDret), tol = 1e-4)

Chapter 3 - Performance Evaluation

Statistical Significance:

  • The model should only use meaningful and significant variables
    • Can simplify by removing non-significant parameters
    • Since we do not know the true parameters, the zero parameters need to be estimated
    • round(coef(flexgarchfit), 6)
  • Tests for statistical significance are available in the rgarch package - general rule of thumb is to keep is abs(t-value) >= 2
    • flexgarchspec <- ugarchspec(mean.model = list(armaOrder = c(1,0)), variance.model = list(model = “gjrGARCH”), distribution.model = “sstd”)
    • flexgarchfit <- ugarchfit(data = msftret, spec = flexgarchspec)
    • round(flexgarchfit@fit$matcoef, 6)

Goodness of Fit:

  • Evaluation criterion depend on the elements that you want to assess
  • Goodness of fit for mean predictions
    • e <- residuals(tgarchfit)
    • mean(e^2)
  • Goodness of fit for variance predictions
    • e <- residuals(tgarchfit)
    • d <- e^2 - sigma(tgarchfit)^2
    • mean(d^2)
  • Goodness of fit for the distribution - higher density means more likely to see returns
    • The higher the density, the more likely the return is under the estimated GARCH model
    • The likelihood of the sample is based on the product of all these densities. It measures how likely it is to that the observed returns come from the estimated GARCH model
    • The higher the likelihood, the better the model fits with your data
    • likelihood(tgarchfit) # compare with likelihood achieved using other models; higher likelihood is better
  • There is a risk of over-fitting since the in-sample data is used for the goodness of fit tests - should add penalties for non-parsimonious models
    • information criteria = - likelihood + penalty(number of parameters)
    • infocriteria(tgarchfit)

Diagnosing Absolute Standardized Returns:

  • Can check the standardized returns - scaled for mean 0 and standard deviation 1
    • Standardized returns should be constant over time, provided the model is making good estimates of mean and sigma over time
    • Should also be no correlations (auto-correlations) in the standardized returns over time
  • Example application to the daily MSFT returns
    • garchspec <- ugarchspec(mean.model = list(armaOrder = c(1,0)), variance.model = list(model = “gjrGARCH”), distribution.model = “sstd”)
    • garchfit <- ugarchfit(data = msftret, spec = garchspec)
    • stdmsftret <- residuals(garchfit, standardize = TRUE)
    • acf(abs(msftret), 22)
    • acf(abs(stdmsftret), 22)
  • Can check the auto-correlations using the Ljung-Box test - desire for p to be GREATER than 5% (no significant auto-correlations found)
    • Box.test(abs(stdmsftret), 22, type = “Ljung-Box”)

Back-testing using ugarchroll:

  • Can use rolling estimation to avoid look-ahead bias
    • tgarchspec <- ugarchspec(mean.model = list(armaOrder = c(1,0)), variance.model = list(model = “sGARCH”), distribution.model = “std”)
    • garchroll <- ugarchroll(tgarchspec, data = EURUSDret, n.start = 2500, refit.window = “moving”, refit.every = 500)
  • There are five arguments to specify
    • GARCH specification used
    • data : return data to use
    • n.start: the size of the initial estimation sample
    • refit.window: how to change that sample through time: “moving” or “expanding
    • refit.every: how often to re-estimate the model
  • Can convert results to data frames
    • preds <- as.data.frame(garchroll)
    • preds$Mu: series of predicted mean values
    • preds$Sigma: series of predicted volatility values
    • garchvol <- xts(preds$Sigma, order.by = as.Date(rownames(preds)))
    • plot(garchvol)
    • preds <- as.data.frame(garchroll)
    • Evaluate accuracy of preds\(Mu and preds\)Sigma by comparing with preds$Realized
    • e <- preds\(Realized - preds\)Mu
    • mean(e^2)
    • d <- e^2 - preds$Sigma^2
    • mean(d^2)
  • Can run a comparison of two models
    • tgarchspec <- ugarchspec(mean.model = list(armaOrder = c(1,0)), variance.model = list(model = “sGARCH”), distribution.model = “std”)
    • garchroll <- ugarchroll(tgarchspec, data = EURUSDret, n.start = 2500, refit.window = “moving”, refit.every = 500)
    • gjrgarchspec <- ugarchspec(mean.model = list(armaOrder = c(1,0)), variance.model = list(model = “gjrGARCH”), distribution.model = “sstd”)
    • gjrgarchroll <- ugarchroll(gjrgarchspec, data = EURUSDret, n.start = 2500, refit.window = “moving”, refit.every = 500)

Example code includes:

# Specify model with AR(1) dynamics, GJR GARCH and skewed student t
flexgarchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(1,0)),
                                     variance.model = list(model = "gjrGARCH"),
                                     distribution.model = "sstd"
                                     )

# Estimate the model
flexgarchfit <- rugarch::ugarchfit(data = EURUSDret, spec = flexgarchspec)

# Complete and study the statistical significance of the estimated parameters  
round(flexgarchfit@fit$matcoef, 6)


# Specify model with constant mean, standard GARCH and student t
tgarchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(1, 0)),
                                  variance.model = list(model = "sGARCH", variance.targeting = TRUE),
                                  distribution.model = "sstd"
                                  )

# Fix the mu parameter at zero
rugarch::setfixed(tgarchspec) <- list("mu" = 0)

# Estimate the model
tgarchfit <- rugarch::ugarchfit(data = EURUSDret, spec = tgarchspec)

# Verify that the differences in volatility are small
plot(rugarch::sigma(tgarchfit) - rugarch::sigma(flexgarchfit))


# Compute prediction errors
garcherrors <- rugarch::residuals(flexgarchfit)
gjrerrors  <- rugarch::residuals(tgarchfit)

# Compute MSE for variance prediction of garchfit model
mean((rugarch::sigma(flexgarchfit)**2 - garcherrors^2)**2)

# Compute MSE for variance prediction of gjrfit model
mean((rugarch::sigma(tgarchfit)**2 - gjrerrors^2)**2)


# Print the number of estimated parameters
length(rugarch::coef(flexgarchfit))
length(rugarch::coef(tgarchfit))

# Print likelihood of the two models
rugarch::likelihood(flexgarchfit)
rugarch::likelihood(tgarchfit)

# Print the information criteria of the two models
rugarch::infocriteria(flexgarchfit)
rugarch::infocriteria(tgarchfit)


# Compute the standardized returns
stdEURUSDret <- rugarch::residuals(tgarchfit, standardize = TRUE)

# Compute their sample mean and standard deviation
mean(stdEURUSDret)
sd(stdEURUSDret)

# Correlogram of the absolute (standardized) returns
par(mfrow = c(1, 2))
acf(abs(EURUSDret), 22)
acf(abs(stdEURUSDret), 22)
par(mfrow = c(1, 1))

# Ljung-Box test
Box.test(abs(stdEURUSDret), 22, type = "Ljung-Box")


# Estimate the model on the last 2500 observations
tgarchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(0,0)),
                                  variance.model = list(model = "sGARCH"),
                                  distribution.model = "std"
                                  )
tgarchfit <- rugarch::ugarchfit(data = tail(EURUSDret, 2500) , spec = tgarchspec)

# Compute standardized returns
stdEURUSDret <- rugarch::residuals(tgarchfit, standardize = TRUE)

# Do the Ljung-Box test on the absolute standardized returns
Box.test(abs(stdEURUSDret), 22, type = "Ljung-Box")


# Estimate the GARCH model using all the returns and compute the in-sample estimates of volatility
garchinsample <- rugarch::ugarchfit(data = sp500ret, spec = flexgarchspec)
garchvolinsample <- rugarch::sigma(garchinsample)

# Use ugarchroll for rolling estimation of the GARCH model 
garchroll <- rugarch::ugarchroll(flexgarchspec, data = sp500ret, 
                                 n.start = 2000, refit.window = "moving",  refit.every = 2500
                                 )

# Set preds to the data frame with rolling predictions
preds <- rugarch::as.data.frame(garchroll)

# Compare in-sample and rolling sample volatility in one plot
garchvolroll <- xts(preds$Sigma, order.by = as.Date(rownames(preds)))
volplot <- plot(garchvolinsample, col = "darkgrey", lwd = 1.5, 
                main = "In-sample versus rolling vol forecasts"
                )
volplot <- addSeries(garchvolroll, col = "blue", on = 1)
plot(volplot)


# Inspect the first three rows of the dataframe with out of sample predictions
head(preds, 3)

# Compute prediction errors
e <- preds$Realized - preds$Mu  
d <- e^2 - preds$Sigma^2 

# Compute MSE for the garchroll variance prediction
garchMSE <- mean(d^2)

# Use ugarchroll for rolling estimation of the GARCH model 
gjrgarchroll <- rugarch::ugarchroll(tgarchspec, data = sp500ret, 
                                    n.start = 2000, refit.window = "moving",  refit.every = 2500
                                    )

# Compute MSE for gjrgarchroll
gjrgarchpreds <- rugarch::as.data.frame(gjrgarchroll)
e <- gjrgarchpreds$Realized - gjrgarchpreds$Mu  
d <- e^2 - gjrgarchpreds$Sigma^2 
gjrgarchMSE <- mean(d**2)

Chapter 4 - Applications

Value at Risk:

  • A common metrics is the 5% VaR, which is the amount that would be lost in the “best case” of the bottom 5% of returns over a specified period of time (such as a year)
    • Target is to predict the VaR forward rather than assess it backwards
  • Workflow for assessing VaR in R
    • garchspec <- ugarchspec(mean.model = list(armaOrder = c(1,0)), variance.model = list(model = “gjrGARCH”), distribution.model = “sstd”)
    • garchroll <- ugarchroll(garchspec, data = sp500ret, n.start = 2500, refit.window = “moving”, refit.every = 100)
    • garchVaR <- quantile(garchroll, probs = 0.05)
    • actual <- xts(as.data.frame(garchroll)$Realized, time(garchVaR))
    • VaRplot(alpha = 0.05, actual = actual, VaR = garchVaR)
  • The VaR exceedence should be 5% (if the VaR is set at the best of the 5%) - called the “coverage” of the VaR

Production and Simulation:

  • Data used for modeling will be different than data used during production
    • Use ugarchfilter() for analyzing the recent dynamics in the mean and volatility
    • Use ugarchforecast() applied to a ugarchspec object (instead of ugarchfit()) object for making the predictions about the future mean and volatility
  • Example of running the full process
    • garchspec <- ugarchspec(mean.model = list(armaOrder = c(1,0)), variance.model = list(model = “gjrGARCH”), distribution.model = “sstd”)
    • garchfit <- ugarchfit(data = msftret[“/2010-12”], spec = garchspec)
    • progarchspec <- garchspec
    • setfixed(progarchspec) <- as.list(coef(garchfit))
    • garchfilter <- ugarchfilter(data = msftret, spec = progarchspec)
    • plot(sigma(garchfilter))
    • garchforecast <- ugarchforecast(data = msftret, fitORspec = progarchspec, n.ahead = 10)
    • cbind(fitted(garchforecast), sigma(garchforecast))
  • Can use the model to simulate log returns - useful to assess randomness in future returns
    • msftlogret <- diff(log(MSFTprice))[(-1)]
    • garchspec <- ugarchspec(mean.model = list(armaOrder = c(1,0)), variance.model = list(model = “gjrGARCH”), distribution.model = “sstd”)
    • garchfit <- ugarchfit(data = msftlogret, spec = garchspec)
    • simgarchspec <- garchspec
    • setfixed(simgarchspec) <- as.list(coef(garchfit))
  • Can run simulations using ugarchpath()
    • spec : completely specified GARCH model
    • m.sim : number of time series of simulated returns you want
    • n.sim: number of observations in the simulated time series (e.g. 252)
    • rseed : any number to fix seed used to generate the simulated series (needed for reproducibility)
    • simgarch <- ugarchpath(spec = simgarchspec, m.sim = 4, n.sim = 10 * 252, rseed = 12345)
    • simret <- fitted(simgarch)
    • plot.zoo(simret)
    • plot.zoo(sigma(simgarch))
    • simprices <- exp(apply(simret, 2, “cumsum”))
    • matplot(simprices, type = “l”, lwd = 3)

Model Risk:

  • Model averaging is often more valuable than a single point estimate from a single model
    • Sensitivity analysis and outlier removal can be valuable
    • variance.models <- c(“sGARCH”, “gjrGARCH”)
    • distribution.models <- c(“norm”, “std”, “std”)
    • c <- 1
    • for (variance.model in variance.models) {
    • for (distribution.model in distribution.models) {
    •   garchspec <- ugarchspec(mean.model = list(armaOrder = c(0, 0)), variance.model = list(model = variance.model), distribution.model = distribution.model)  
    •   garchfit <- ugarchfit(data = msftret, spec = garchspec)  
    •   if (c==1) { msigma <- sigma(garchfit) } else { msigma <- merge(msigma, sigma(garchfit)) }
    •   c <- c + 1  
    • }
    • }
    • avesigma <- xts(rowMeans(msigma), order.by = time(msigma))
  • Robustness to starting values - additional source of risk to the GARCH models
    • garchspec <- ugarchspec(mean.model = list(armaOrder = c(0,0)), variance.model = list(model = “sGARCH”), distribution.model = “sstd”)
    • garchfit <- ugarchfit(data = sp500ret, spec = garchspec)
    • coef(garchfit)
    • likelihood(garchfit)
    • setstart(garchspec) <- list(alpha1 = 0.05, beta1 = 0.9, shape = 8)
    • garchfit <- ugarchfit(data = sp500ret, spec = garchspec)
  • Outliers in the underlying return data are an additional source of potential error
    • library(PerformanceAnalytics)
    • clmsftret <- Return.clean(msftret, method = “boudt”)
    • plotret <- plot(msftret, col = “red”)
    • plotret <- addSeries(clmsftret, col = “blue”, on = 1)
    • garchspec <- ugarchspec(mean.model = list(armaOrder = c(1,0)), variance.model = list(model = “gjrGARCH”), distribution.model = “sstd”)
    • garchfit <- ugarchfit(data = msftret, spec = garchspec)
    • clgarchfit <- ugarchfit(data = clmsftret, spec = garchspec)
    • plotvol <- plot(abs(msftret), col = “gray”)
    • plotvol <- addSeries(sigma(garchfit), col = “red”, on = 1)
    • plotvol <- addSeries(sigma(clgarchfit), col = “blue”, on = 1)
    • plotvol

GARCH Covariance:

  • Covariances of asset returns may vary over time
  • GARCH covariance can be estimated in four steps
    • Step 1: Use ugarchfit() to estimate the GARCH model for each return series.
    • msftgarchfit <- ugarchfit(data = msftret, spec = garchspec)
    • wmtgarchfit <- ugarchfit(data = wmtret, spec = garchspec)
    • Step 2: Use residuals() to compute the standardized returns.
    • stdmsftret <- residuals(msftgarchfit, standardize = TRUE)
    • stdwmtret <- residuals(wmtgarchfit, standardize = TRUE)
    • Step 3: Use cor() to estimate ? as the sample correlation of the standardized returns.
    • msftwmtcor <- as.numeric(cor(stdmsftret, stdwmtret))
    • msftwmtcor
    • Step 4: Compute the GARCH covariance by multiplying the estimated correlation and volatilities
    • msftwmtcov <- msftwmtcor * sigma(msftgarchfit) * sigma(wmtgarchfit)
  • Covariance has many applications in finance
    • Optimization of portfolio variance
    • msftvar <- sigma(msftgarchfit)^2
    • wmtvar <- sigma(wmtgarchfit)^2
    • msftwmtcov <- msftwmtcor * sigma(msftgarchfit) * sigma(wmtgarchfit)
    • msftweight <- (wmtvar - msftwmtcov) / (msftvar + wmtvar - 2 * msftwmtcov)
    • Dynamic beta (systematic risk) of a specific stock
    • msftsp500cor <- as.numeric(cor(stdmsftret, stdsp500ret))
    • msftsp500cov <- msftsp500cor * sigma(msftgarchfit) * sigma(sp500garchfit)
    • sp500var <- sigma(sp500garchfit)^2
    • msftbeta <- msftsp500cov / sp500var

Wrap Up:

  • Language of GARCH models
    • Volatility and volatility clustering
    • Information set and predictions
    • Leverage effect and GJR GARCH
    • Skewness, fat tails, and student t
    • Ljung-Box test, MSE, and other model validation methods
  • Language of rugarch
    • ugarchspec()
    • ugarchfit()
    • ugarchroll()
    • ugarchforecast()
    • ugarchfilter()
    • ugarchpath()
    • Many useful methods sigma(), fitted(), coef(), infocriteria(), likelihood(), setfixed(), setbounds(), quantile()…

Example code includes:

flexgarchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(1,0)),
                                     variance.model = list(model = "gjrGARCH"),
                                     distribution.model = "sstd"
                                     )
garchroll <- rugarch::ugarchroll(flexgarchspec, data = msftret, 
                                 n.start = 2000, refit.window = "moving",  refit.every = 2500
                                 )

# Extract the dataframe with predictions from the rolling GARCH estimation
garchpreds <- rugarch::as.data.frame(garchroll)

# Extract the 5% VaR 
garchVaR <- rugarch::quantile(garchroll, probs = 0.05)

# Extract the volatility from garchpreds
garchvol <- xts(garchpreds$Sigma, order.by = time(garchVaR))

# Analyze the comovement in a time series plot
garchplot <- plot(garchvol, ylim = c(-0.1, 0.1))
garchplot <- addSeries(garchVaR, on = 1, col = "blue")
plot(garchplot, main = "Daily vol and 5% VaR")


# Take a default specification a with a normal and skewed student t distribution
normgarchspec <- rugarch::ugarchspec(distribution.model = "norm")
sstdgarchspec <- rugarch::ugarchspec(distribution.model = "sstd")

# Do rolling estimation
normgarchroll <- rugarch::ugarchroll(normgarchspec, data = msftret, n.start = 2500, 
                                     refit.window = "moving", refit.every = 2000
                                     )
sstdgarchroll <- rugarch::ugarchroll(sstdgarchspec, data = msftret, n.start = 2500, 
                                     refit.window = "moving", refit.every = 2000
                                     )

# Compute the 5% value at risk
normgarchVaR <- rugarch::quantile(normgarchroll, probs = 0.05)
sstdgarchVaR <- rugarch::quantile(sstdgarchroll, probs = 0.05)

# Compute the coverage
actual <- xts(rugarch::as.data.frame(normgarchroll)$Realized, time(normgarchVaR))
mean(actual < normgarchVaR)
mean(actual < sstdgarchVaR)


garchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(1,0)), 
                                 variance.model = list(model = "gjrGARCH"), 
                                 distribution.model = "sstd"
                                 )

# Estimate the model
garchfit <- rugarch::ugarchfit(data = sp500ret["/2006-12"], spec = garchspec)

# Fix the parameters
progarchspec <- garchspec
rugarch::setfixed(progarchspec) <- as.list(rugarch::coef(garchfit))

# Use ugarchfilter to obtain the estimated volatility for the complete period
garchfilter <- rugarch::ugarchfilter(data = sp500ret, spec = progarchspec)
plot(rugarch::sigma(garchfilter))

# Compare the 252 days ahead forecasts made at the end of September 2008 and September 2017
garchforecast2008 <- rugarch::ugarchforecast(data = sp500ret["/2008-09"], 
                                             fitORspec = progarchspec, n.ahead = 252
                                             )
garchforecast2017 <- rugarch::ugarchforecast(data = sp500ret["/2017-09"], 
                                             fitORspec = progarchspec, n.ahead = 252
                                             )
par(mfrow = c(2, 1), mar = c(3, 2, 3, 2))
plot(rugarch::sigma(garchforecast2008), main = "/2008-09", type = "l")
plot(rugarch::sigma(garchforecast2017), main = "/2017-09", type = "l")
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))


simgarchspec <- garchspec
rugarch::setfixed(simgarchspec) <- as.list(rugarch::coef(garchfit))

# Complete the code to simulate 4 time series of 10 years of daily returns
simgarch <- rugarch::ugarchpath(spec=simgarchspec, m.sim = 4, n.sim = 10*252, rseed = 210) 

# Plot the simulated returns of the four series
simret <- rugarch::fitted(simgarch)
plot.zoo(simret)
plot.zoo(rugarch::sigma(simgarch))

# Compute the corresponding simulated prices and plot them
simprices <- exp(apply(simret, 2, "cumsum"))
matplot(simprices, type = "l", lwd = 3)


# Specify model with constant mean, standard GARCH and student t
garchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(0, 0)),
                                 variance.model = list(model = "sGARCH", variance.targeting = FALSE),
                                 distribution.model = "std"
                                 )

# Estimate using default starting values
garchfit <- rugarch::ugarchfit(spec=garchspec, data=EURUSDret)

# Print the estimated parameters and the likelihood
rugarch::coef(garchfit)
rugarch::likelihood(garchfit)

# Set other starting values and re-estimate
rugarch::setstart(garchspec) <- list(alpha1 = 0.05, beta1 = 0.9, shape = 6) 
garchfit <- rugarch::ugarchfit(spec=garchspec, data=EURUSDret)

# Print the estimated parameters and the likelihood
rugarch::coef(garchfit)
rugarch::likelihood(garchfit)


garchspec <- rugarch::ugarchspec(mean.model = list(armaOrder = c(1,0)), 
                                 variance.model = list(model = "gjrGARCH"), 
                                 distribution.model = "sstd"
                                 )
usgarchfit <- rugarch::ugarchfit(spec=garchspec, data=ret["2009/2017"])
eugarchfit <- rugarch::ugarchfit(spec=garchspec, data=msftret["2009/2017"])

# Compute the standardized US and EU returns, together with their correlation 
stdusret <- rugarch::residuals(usgarchfit, standardize = TRUE)
stdeuret <- rugarch::residuals(eugarchfit, standardize = TRUE)

useucor <- as.numeric(cor(stdusret, stdeuret))
print(useucor)

# Compute the covariance and variance of the US and EU returns 
useucov <- useucor * rugarch::sigma(usgarchfit) * rugarch::sigma(eugarchfit)
usvar <- rugarch::sigma(usgarchfit)**2
euvar <- rugarch::sigma(eugarchfit)**2

# Compute the minimum variance weight of the US ETF in the US-EU ETF portfolio 
usweight <- (euvar - useucov) / (usvar + euvar - 2*useucov)
plot(usweight)


# Compute standardized returns
# stdmsftret <- residuals(msftgarchfit, standardize=TRUE)
# stdwmtret <- residuals(wmtgarchfit, standardize=TRUE)

# Print the correlation
# cor(stdmsftret, stdwmtret)

# Load the package PerformanceAnalytics
# library(PerformanceAnalytics)

# Plot the 3-month rolling correlation
# chart.RollingCorrelation(stdmsftret, stdwmtret, width = 66, main = "3-month rolling correlation between MSFT and WMT daily returns")

RNA-Seq Differential Expression Analysis

Chapter 1 - Introduction to RNA-Seq Theory and Workflow

Introduction to RNA-Seq:

  • Discovery of genes that are differentially expressed between groups
  • The genome contains the instructions for life - double-stranded DNA with chromosomes built of nucleotides (G-A-C-T)
    • A-T and G-C nucleotides are paired up; the order is called the DNA sequence
    • Within the sequence, there are genes, which create messenger RNA which produces proteins
    • To convert to proteins, messenger RNA must undergo transcription (DNA -> pre-mRNA -> Mature-mRNA -> Protein)
  • Muscle cells, nerve cells, and other cells activate different portions of the DNA
    • Mutations can effect the types of quantities of DNA that are produced
  • Differential expression analysis looks at differences in gene expression across groups, over time, correlated with other changes, driven by what pathways, etc.

RNA-Seq Workflow:

  • Experiment planning is important
    • Technical replicates: Generally low technical variation, so unnecessary
    • Biological replicates: Crucial to the success of RNA-Seq differential expression analyses. The more replicates the better, but at the very least have 3
    • Batch effects: Avoid as much as possible and note down all experimental variables
  • Biological sample preparation - isolate RNA, generate cDNA, FASTQ sequencing, RNA reads (sequence ends)
    • Followed by mapping to the genome (exon and intron alignment is needed)
    • Counting reads associated with genes - drives estimates of gene counts
  • Can run the process using Bioconductor
    • wt_rawcounts <- read.csv(“fibrosis_wt_rawcounts.csv”)

Differential Gene Expression Theory:

  • Differential expression is based on normalized counts, adjusted for variation in the data
    • Are the differences between groups significant?
  • Course will use a publicly available dataset based on SMOC2 data as related to kidney fibrosis
  • Often, there is a long right tail - Poisson might be a good model if there were no biological variation
    • ggplot(raw_counts) + geom_histogram(aes(x = wt_normal1), stat = “bin”, bins = 200) + xlab(“Raw expression counts”) + ylab(“Number of genes”)
  • The negative binomial model works better for the real-world data
    • wt_rawcounts <- read.csv(“fibrosis_wt_rawcounts.csv”)
    • genotype <- c(“wt”, “wt”, “wt”, “wt”, “wt”, “wt”, “wt”)
    • condition <- c(“normal”, “fibrosis”, “normal”, “fibrosis”, “normal”, “fibrosis”, “fibrosis”)
    • wt_metadata <- data.frame(genotype, wildtype)

Example code includes:

# Load library for DESeq2
library(DESeq2)

# Load library for RColorBrewer
library(RColorBrewer)

# Load library for pheatmap
library(pheatmap)

# Load library for tidyverse
library(tidyverse)


# Explore the first six observations of smoc2_rawcounts
head(smoc2_rawcounts)

# Explore the structure of smoc2_rawcounts
str(smoc2_rawcounts)


# Create genotype vector
genotype <- rep("smoc2_oe", 7)

# Create condition vector
condition <- c(rep("fibrosis", 4), rep("normal", 3))

# Create data frame
smoc2_metadata <- data.frame(genotype, condition)

# Assign the row names of the data frame
rownames(smoc2_metadata) <- paste0("smoc2_", condition, c(1:4, 1, 3, 4))

Chapter 2 - Exploratory Data Analysis

Introduction to Differential Expression Analysis:

  • The DESeq2 is a common method for running differential expression analysis
    • vignette(DESeq2)
  • The DE analysis comes after quality control
    • Quality control - normalization and unsupervised clustering analysis
    • DE analysis - modelling raw counts, shrinking log2 fold changes, differential expression analysis
  • Need to bring in the raw counts
    • wt_rawcounts <- read.csv(“fibrosis_wt_rawcounts.csv”)
    • wt_metadata <- read.csv(“fibrosis_wt_metadata_unordered.csv”)

Organizing the data for DESeq2:

  • Need to have the metadata and the sample data in the same order and with the same rownames/colnames
    • all(rownames(wt_metadata) == colnames(wt_rawcounts))
    • match(vector1, vector2) # vector 1 has the desired order, vector2 has the target vector for reordering, output are the indices for reogranizing vector2 to align with vector1
    • idx <- match(colnames(wt_rawcounts), rownames(wt_metadata))
    • reordered_wt_metadata <- wt_metadata[idx, ]
    • all(rownames(reordered_wt_metadata) == colnames(wt_rawcounts))
  • Can then create the DESeq2 object
    • dds_wt <- DESeqDataSetFromMatrix(countData = wt_rawcounts, colData = reordered_wt_metadata, design = ~ condition)

Count Normalization:

  • First step is to normalize raw counts to assess sample-level consistencies
    • Factors other than RNA expression can drive the expression of genes - library depth, gene length (longer gene means more fragments for sequencing), RNA composition
    • dds_wt <- estimateSizeFactors(dds_wt)
    • sizeFactors(dds_wt)
    • normalized_wt_counts <- counts(dds_wt, normalized=TRUE)

Hierarchical Heatmap:

  • Can compare the normalized counts across samples - starting with visualization and clustering methods such as hclust() or PCA
    • vsd_wt <- vst(dds_wt, blind=TRUE) # vst is a logarithmic transformation that moderates variance; blind=TRUE means that it should be blinded to the sampling metadata, and should be set for QC
  • Can run hierarchical clustering combines with heatmaps to assess clustering - generally, should see correlations > 0.8 unless there are outliers suggestive of quality errors (can confirm with PCA)
    • vsd_mat_wt <- assay(vsd_wt)
    • vsd_cor_wt <- cor(vsd_mat_wt)
    • library(pheatmap)
    • pheatmap(vsd_cor_wt, annotation = select(wt_metadata, condition))

Principal Component Analysis:

  • PCA is a technique to emphasize the variation within a dataset - first component represents the greatest variance
    • Genes can be given scores based on the degree to which they influence eachprincipal component
  • PCA can be performed using native DESeq2 functions
    • plotPCA(vsd_wt, intgroup=“condition”) # intgroup is the metadata variable for coloring

Example code includes:

# Use the match() function to reorder the columns of the raw counts
match(rownames(smoc2_metadata), colnames(smoc2_rawcounts))

# Reorder the columns of the count data
reordered_smoc2_rawcounts <- smoc2_rawcounts[, match(rownames(smoc2_metadata), colnames(smoc2_rawcounts))]

# Create a DESeq2 object
dds_smoc2 <- DESeqDataSetFromMatrix(countData =  reordered_smoc2_rawcounts,
                              colData =  smoc2_metadata,
                              design = ~ condition)


# Determine the size factors to use for normalization
dds_smoc2 <- estimateSizeFactors(dds_smoc2)

# Extract the normalized counts
smoc2_normalized_counts <- counts(dds_smoc2, normalized=TRUE)


# Transform the normalized counts 
vsd_smoc2 <- vst(dds_smoc2, blind=TRUE)

# Extract the matrix of transformed counts
vsd_mat_smoc2 <- assay(vsd_smoc2)

# Compute the correlation values between samples
vsd_cor_smoc2 <- cor(vsd_mat_smoc2) 

# Plot the heatmap
pheatmap(vsd_cor_smoc2, annotation = select(smoc2_metadata, condition))


# Transform the normalized counts 
vsd_smoc2 <- vst(dds_smoc2, blind = TRUE)

# Plot the PCA of PC1 and PC2
plotPCA(vsd_smoc2, intgroup="condition")

Chapter 3 - Differential Expression Analysis with DESeq2

DE Analysis:

  • The DE analysis workflow includes 1) fitting raw counts to the negative binomial model, 2) shrinking the log2 fold changes, and 3) visualizing results
  • Begin by fitting raw counts to the negative binomial model
    • dds_wt <- DESeqDataSetFromMatrix(countData = wt_rawcounts, colData = reordered_wt_metadata, design = ~ condition) # can have ~ a + b + c in the design=, with the main component (e.g., treatment) last
    • ~ strain + sex + treatment + sex:treatment # interaction term for sex and treatment; include as the last term
    • dds_wt <- DESeq(dds_wt)

DESeq2 Model - Dispersion:

  • Can begin by exploring the model fit - are genes expressed differently across rather than within groups
    • Log2(TreatmentMean / ControlMean)
    • mean_counts <- apply(wt_rawcounts[, 1:3], 1, mean)
    • variance_counts <- apply(wt_rawcounts[, 1:3], 1, var)
    • df <- data.frame(mean_counts, variance_counts)
    • ggplot(df) + geom_point(aes(x=mean_counts, y=variance_counts)) + scale_y_log10() + scale_x_log10() + xlab(“Mean counts per gene”) + ylab(“Variance per gene”)
  • Variance vs. mean is called “dispersion” in the DESeq2 modeling
    • Dispersion: Variance = Mean + Dispersion * Mean**2
    • Increase in variance leads to increase in dispersion
    • Increase in mean leads to descrease in dispersion
    • plotDispEsts(dds_wt)

DESeq2 Model - Contrasts:

  • The negative binomial model is good for representing RNA Seq data
  • By default, the Wald test is run based on the condition
    • results(wt_dds, alpha = 0.05) # alpha is the desired significance
  • Can supply own contrasts using the contrasts argument
    • GENERAL SYNTAX: results(dds, contrast = c(“condition_factor”, “level_to_compare”, “base_level”), alpha = 0.05)
    • wt_res <- results(dds_wt, contrast = c(“condition”, “fibrosis”, “normal”), alpha = 0.05)
  • Can use the MA plot for further explorations
    • plotMA(wt_res, ylim=c(-8,8))
  • LFC shrinkage can be helpful for addressing high dispersions or low means
    • wt_res <- lfcShrink(dds_wt, contrast=c(“condition”, “fibrosis”, “normal”), res=wt_res)
    • plotMA(wt_res, ylim=c(-8,8))

DESeq2 Results:

  • Can assess the results using the DESeq2 workflows
    • mcols(wt_res)
    • head(wt_res, n=10)
  • Multiple test corrections are run using the BH method
    • summary(wt_res)
    • wt_res <- results(dds_wt, contrast = c(“condition”, “fibrosis”, “normal”), alpha = 0.05, lfcThreshold = 0.32)
    • wt_res <- lfcShrink(dds_wt, contrast=c(“condition”, “fibrosis”, “normal”), res=wt_res)
    • summary(wt_res)
  • Can annotate the gene names for easier exploration and further analysis
    • library(annotables)
    • grcm38
    • wt_res_all <- data.frame(wt_res) %>% rownames_to_column(var = “ensgene”) %>% left_join(x = wt_res_all, y = grcm38[, c(“ensgene”, “symbol”, “description”)], by = “ensgene”)
    • wt_res_sig <- subset(wt_res_all, padj < 0.05)
    • wt_res_sig <- wt_res_sig %>% arrange(padj)

Eample code includes:

# Create DESeq2 object
dds_smoc2 <- DESeqDataSetFromMatrix(countData = reordered_smoc2_rawcounts, colData = smoc2_metadata, design = ~ condition)

# Run the DESeq2 analysis
dds_smoc2 <- DESeq(dds_smoc2)


# Plot dispersions
plotDispEsts(dds_smoc2)


# Extract the results of the differential expression analysis
smoc2_res <- results(dds_smoc2, 
                contrast = c("condition", "fibrosis", "normal"), 
                alpha = 0.05)


# Shrink the log2 fold change estimates to be more accurate
smoc2_res <- lfcShrink(dds_smoc2, 
                    contrast =  c("condition", "fibrosis", "normal"),
                    res = smoc2_res)


# Explore the results() function
?results

# Extract results
smoc2_res <- results(dds_smoc2, 
                contrast = c("condition", "fibrosis", "normal"), 
                alpha = 0.05, 
                lfcThreshold = 0.32)

# Shrink the log2 fold changes
smoc2_res <- lfcShrink(dds_smoc2, 
                       contrast = c("condition", "fibrosis", "normal"),
                       res = smoc2_res)


# Get an overview of the results                    
summary(smoc2_res)


# Save results as a data frame
smoc2_res_all <- data.frame(smoc2_res)

# Subset the results to only return the significant genes with p-adjusted values less than 0.05
smoc2_res_sig <- subset(smoc2_res_all, padj < 0.05)

Chapter 4 - Exploration of Differential Expression Results

Visualization of Results:

  • Expression heatmaps explore the expression of key genes
    • sig_norm_counts_wt <- normalized_counts_wt[wt_res_sig$ensgene, ]
    • library(RColorBrewer)
    • heat_colors <- brewer.pal(6, “YlOrRd”)
    • display.brewer.all()
    • pheatmap(sig_norm_counts_wt, color = heat_colors, cluster_rows = T, show_rownames = F, annotation = select(wt_metadata, condition), scale = “row”)
  • Can also plot using the volcano plot
    • wt_res_all <- wt_res_all %>% rownames_to_column(var = “ensgene”) %>% mutate(threshold = padj < 0.05)
    • ggplot(wt_res_all) + geom_point(aes(x = log2FoldChange, y = -log10(padj), color = threshold)) + xlab(“log2 fold change”) + ylab(“-log10 adjusted p-value”) + theme(legend.position = “none”, plot.title = element_text(size = rel(1.5), hjust = 0.5), axis.title = element_text(size = rel(1.25)))
  • The expression plot can also be helpful - e.g., top 20 genes plot
    • top_20 <- data.frame(sig_norm_counts_wt)[1:20, ] %>% rownames_to_column(var = “ensgene”)
    • top_20 <- gather(top_20, key = “samplename”, value = “normalized_counts”, 2:8)
    • top_20 <- inner_join(top_20, rownames_to_column(wt_metadata, var = “samplename”), by = “samplename”)
    • ggplot(top_20) + geom_point(aes(x = ensgene, y = normalized_counts, color = condition)) + scale_y_log10() + xlab(“Genes”) + ylab(“Normalized Counts”) + ggtitle(“Top 20 Significant DE Genes”) + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme(plot.title = element_text(hjust = 0.5))

RNA-Seq DE Analysis Setup:

  • Can explore and filter for relevant and/or expected genes
    • Samples/library preparation
    • Sequence reads
    • Quality control
    • Splice-aware mapping to genome
    • Counting reads associated with genes
    • Statistical analysis to identify differentially expressed genes
  • The initial preparation uses the DESeq2 library
    • dds <- DESeqDataSetFromMatrix(countData = rawcounts, colData = metadata, design = ~ condition)

RNA-Seq DE Analysis Summary:

  • Need to normalize the data for DESeq analysis
    • dds <- estimateSizeFactors(dds)
    • normalized_counts <- counts(dds, normalized=TRUE)
  • Can cluster and plot for quality assurance and exploratory analysis
    • vsd <- vst(dds, blind=TRUE)
    • vsd %>% assay() %>% # Extract the vst matrix from the object cor() %>% # Compute pairwise correlation values pheatmap(annotation = metadata[ , c(“column_name1”, “column_name2])
    • plotPCA(vsd, intgroup=“condition”)
  • Can then run the DE analysis
    • dds <- DESeqDataSetFromMatrix(countData = rawcounts, colData = metadata, design = ~ source_of_variation + condition)
    • dds <- DESeq(dds)
  • Can then explore fits using dispersions - should decrease with increasing mean - and then run LFC shrinkage
    • plotDispEsts(dds)
    • res <- results(dds, contrast=c(“condition_factor”, “level_to_compare”, “base_level”), alpha = 0.05)
    • res_all <- data.frame(res) %>% rownames_to_column(var = “ensgene”)
    • res_all <- left_join(x=res_all, y=grcm38[, c(“ensgene”, “symbol”, “description”)], by = “ensgene”)
    • res_all <- arrange(res_all, padj)
  • Identify significantly differently expressed genes
    • res_sig <- subset(res_all, padj < 0.05)
    • plotMA(), volcano plot, expression heatmap

RNA-Seq Next Steps:

Example code includes:

# Create MA plot
plotMA(smoc2_res)

# Generate logical column 
smoc2_res_all <- data.frame(smoc2_res) %>% mutate(threshold = padj < 0.05)
              
# Create the volcano plot
ggplot(smoc2_res_all) + 
        geom_point(aes(x = log2FoldChange, y = -log10(padj), color = threshold)) + 
        xlab("log2 fold change") + 
        ylab("-log10 adjusted p-value") + 
        theme(legend.position = "none", 
              plot.title = element_text(size = rel(1.5), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25)))


# Subset normalized counts to significant genes
sig_norm_counts_smoc2 <- normalized_counts_smoc2[rownames(smoc2_res_sig), ]

# Choose heatmap color palette
heat_colors <- brewer.pal(n = 6, name = "YlOrRd")

# Plot heatmap
pheatmap(sig_norm_counts_smoc2, 
         color = heat_colors, 
         cluster_rows = TRUE, 
         show_rownames = FALSE,
         annotation = select(smoc2_metadata, condition), 
         scale = "row")


# Check that all of the samples are in the same order in the metadata and count data
all(colnames(all_rawcounts) %in% rownames(all_metadata))

# DESeq object to test for the effect of fibrosis regardless of genotype
dds_all <- DESeqDataSetFromMatrix(countData = all_rawcounts,
                        colData = all_metadata,
                        design = ~ genotype + condition)

# DESeq object to test for the effect of genotype on the effect of fibrosis                        
dds_complex <- DESeqDataSetFromMatrix(countData = all_rawcounts,
                        colData = all_metadata,
                        design = ~ genotype + condition + genotype:condition)


# Log transform counts for QC
vsd_all <- vst(dds_all, blind = TRUE)

# Create heatmap of sample correlation values
vsd_all %>% 
        assay() %>%
        cor() %>%
        pheatmap(annotation = select(all_metadata, c("condition", "genotype")))

# Create the PCA plot for PC1 and PC2 and color by condition       
plotPCA(vsd_all, intgroup="condition")

# Create the PCA plot for PC1 and PC2 and color by genotype       
plotPCA(vsd_all, intgroup="genotype")


# Select significant genese with padj < 0.05
smoc2_sig <- subset(res_all, padj < 0.05) %>%
    data.frame() %>%
    rownames_to_column(var = "geneID")

# Extract the top 6 genes with padj values
smoc2_sig %>%
    arrange(padj) %>%
    select(geneID, padj) %>%
    head()

Survival Analysis in R

Chapter 1 - What is Survival Analysis?

The term “survival analysis”:

  • Survival analysis methods are the same whether they are time to progression, time to finding a job, etc.
  • Can be called “time to event” since the event need not be death; “survival analysis” is just the most commonly used metric
  • Data sets for the course
    • data(GBSG2, package = “TH.data”) # time to death in breast cancer
    • data(UnempDur, package = “Ecdat”) # time to re-employment
    • help(UnempDur, package = “Ecdat”)

Why learn survival methods?

  • Survival analysis requires something more than a linear model
    • Times are always positive; cannot be negative, nor have a negative number of survivors
    • Different measures are of interest - hazard functions
    • Censoring is almost always an issue - often do not know when the event will happen, only that it did not happen by time X
  • Can use the R package survival
    • time <- c(5, 6, 2, 4, 4)
    • event <- c(1, 0, 0, 1, 1)
    • library(“survival”)
    • Surv(time, event)
  • R packages that are available for survival analysis
    • library(“survival”)
    • library(“survminer”)

Measures used in survival analysis:

  • Typical survival analysis questions include
    • What is the probability that a breast cancer patient survives longer than 5 years?
    • What is the typical waiting time for a cab?
    • Out of 100 unemployed people, how many do we expect to have a job again after 2 months?
  • The survival function is S(t) = P(T > t) = 1 - F(t) where F is the CFD
    • At any point in time, how probable is it to survive at least that long
    • The median is one of the common metrics used

Example code includes:

# Check out the help page for this dataset
# help(GBSG2, package = "TH.data")

# Load the data
data(GBSG2, package = "TH.data")

# Look at the summary of the dataset
summary(GBSG2)
##  horTh          age        menostat       tsize        tgrade   
##  no :440   Min.   :21.00   Pre :290   Min.   :  3.00   I  : 81  
##  yes:246   1st Qu.:46.00   Post:396   1st Qu.: 20.00   II :444  
##            Median :53.00              Median : 25.00   III:161  
##            Mean   :53.05              Mean   : 29.33            
##            3rd Qu.:61.00              3rd Qu.: 35.00            
##            Max.   :80.00              Max.   :120.00            
##      pnodes         progrec           estrec             time       
##  Min.   : 1.00   Min.   :   0.0   Min.   :   0.00   Min.   :   8.0  
##  1st Qu.: 1.00   1st Qu.:   7.0   1st Qu.:   8.00   1st Qu.: 567.8  
##  Median : 3.00   Median :  32.5   Median :  36.00   Median :1084.0  
##  Mean   : 5.01   Mean   : 110.0   Mean   :  96.25   Mean   :1124.5  
##  3rd Qu.: 7.00   3rd Qu.: 131.8   3rd Qu.: 114.00   3rd Qu.:1684.8  
##  Max.   :51.00   Max.   :2380.0   Max.   :1144.00   Max.   :2659.0  
##       cens       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4359  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
# Count censored and uncensored data
num_cens <- table(GBSG2$cens)
num_cens
## 
##   0   1 
## 387 299
# Create barplot of censored and uncensored data
barplot(table(GBSG2$cens))

# Create Surv-Object
sobj <- survival::Surv(GBSG2$time, GBSG2$cens)

# Look at 10 first elements
sobj[1:10]
##  [1] 1814  2018   712  1807   772   448  2172+ 2161+  471  2014+
# Look at summary
summary(sobj)
##       time            status      
##  Min.   :   8.0   Min.   :0.0000  
##  1st Qu.: 567.8   1st Qu.:0.0000  
##  Median :1084.0   Median :0.0000  
##  Mean   :1124.5   Mean   :0.4359  
##  3rd Qu.:1684.8   3rd Qu.:1.0000  
##  Max.   :2659.0   Max.   :1.0000
# Look at structure
str(sobj)
##  'Surv' num [1:686, 1:2] 1814  2018   712  1807   772   448  2172+ 2161+  471  2014+ ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "time" "status"
##  - attr(*, "type")= chr "right"

Chapter 2 - Estimation of Survival Curves

Kaplan-Meier Estimate:

  • The KM estimate is the step function derived from the raw data
    • Censoring puts a mark on the curve, but dose not step down the curve
    • S(t) = S(t-1) * e(t) / n(t) where e(t) is events occuring at t and n(t) is the number at-risk (not previously censored) at time t
  • Can run KM estimates using the survival function
    • km <- survfit(Surv(time, event) ~ 1)
    • ggsurvplot(km, conf.int = FALSE, risk.table = “nrisk_cumevents”, legend = “none”)

Understanding and Visualizing Kaplan-Meier Curves:

  • The ggsurvplot function can be useful for visualizations
    • library(survminer)
    • ggsurvplot(fit)
    • ggsurvplot( fit, palette = NULL, linetype = 1, surv.median.line = “none”, risk.table = FALSE, cumevents = FALSE, cumcensor = FALSE, tables.height = 0.25, … )
    • ggsurvplot( fit = km, palette = “blue”, linetype = 1, surv.median.line = “hv”, risk.table = TRUE, cumevents = TRUE, cumcensor = TRUE, tables.height = 0.1 ) # hv is horizontal and vertical
  • Can also revisit the survfit() object, which produces the KM curve
    • If object is a formula: Kaplan-Meier estimation
    • Other options for object - coxph, survreg

Weibull Model for Estimating Survival Curves:

  • The Weibull model extends the KM curve through smoothing which can be more valuable for predictions
    • wb <- survreg(Surv(time, event) ~ 1, data) # reg is regression which is Weibull
    • km <- survfit(Surv(time, event) ~ 1, data) # fit is step-function KM
    • predict(wb, type = “quantile”, p = 1 - 0.9, newdata = data.frame(1)) # dummy newdata since there are no covariates
  • Can also create the survival curve
    • surv <- seq(.99, .01, by = -.01)
    • t <- predict(wb, type = “quantile”, p = 1 - surv, newdata = data.frame(1))
    • head(data.frame(time = t, surv = surv))

Visualizing Results of Weibull Model:

  • Most of the visualization tools, such as ggsurvplot(), work for step functions rather than on curves
    • wb <- survreg(Surv(time, cens) ~ 1)
    • surv <- seq(.99, .01, by = -.01)
    • t <- predict(wb, type = “quantile”, p = 1 - surv, newdata = data.frame(1))
    • surv_wb <- data.frame(time = t, surv = surv, upper = NA, lower = NA, std.err = NA)
    • ggsurvplot_df(fit = surv_wb, surv.geom = geom_line)

Example code includes:

# Create time and event data
time <- c(5, 6, 2, 4, 4)
event <- c(1, 0, 0, 1, 1)

# Compute Kaplan-Meier estimate
km <- survival::survfit(survival::Surv(time, event) ~ 1)
km
## Call: survfit(formula = survival::Surv(time, event) ~ 1)
## 
##       n  events  median 0.95LCL 0.95UCL 
##     5.0     3.0     4.5     4.0      NA
# Take a look at the structure
str(km)
## List of 17
##  $ n         : int 5
##  $ time      : num [1:4] 2 4 5 6
##  $ n.risk    : num [1:4] 5 4 2 1
##  $ n.event   : num [1:4] 0 2 1 0
##  $ n.censor  : num [1:4] 1 0 0 1
##  $ surv      : num [1:4] 1 0.5 0.25 0.25
##  $ std.err   : num [1:4] 0 0.5 0.866 0.866
##  $ cumhaz    : num [1:4] 0 0.5 1 1
##  $ std.chaz  : num [1:4] 0 0.354 0.612 0.612
##  $ start.time: num 0
##  $ type      : chr "right"
##  $ logse     : logi TRUE
##  $ conf.int  : num 0.95
##  $ conf.type : chr "log"
##  $ lower     : num [1:4] 1 0.1877 0.0458 0.0458
##  $ upper     : num [1:4] 1 1 1 1
##  $ call      : language survfit(formula = survival::Surv(time, event) ~ 1)
##  - attr(*, "class")= chr "survfit"
# Create data.frame
data.frame(time = km$time, n.risk = km$n.risk, n.event = km$n.event, 
           n.censor = km$n.censor, surv = km$surv
           )
##   time n.risk n.event n.censor surv
## 1    2      5       0        1 1.00
## 2    4      4       2        0 0.50
## 3    5      2       1        0 0.25
## 4    6      1       0        1 0.25
# Create dancedat data
dancedat <- data.frame(name = c("Chris", "Martin", "Conny", "Desi", "Reni", "Phil", "Flo", "Andrea", "Isaac", "Dayra", "Caspar"),
                       time = c(20, 2, 14, 22, 3, 7, 4, 15, 25, 17, 12),
                       obs_end = c(1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0)
                       )

# Estimate the survivor function pretending that all censored observations are actual observations.
km_wrong <- survival::survfit(survival::Surv(time) ~ 1, data = dancedat)

# Estimate the survivor function from this dataset via kaplan-meier.
km <- survival::survfit(survival::Surv(time, obs_end) ~ 1, data = dancedat)

# Plot the two and compare
survminer::ggsurvplot_combine(list(correct = km, wrong = km_wrong))

# Kaplan-Meier estimate
km <- survival::survfit(survival::Surv(time, cens) ~ 1, data = GBSG2)

# plot of the Kaplan-Meier estimate
survminer::ggsurvplot(km)

# add the risk table to plot
survminer::ggsurvplot(km, risk.table = TRUE)

# add a line showing the median survival time
survminer::ggsurvplot(km, risk.table = TRUE, surv.median.line = "hv")

# Weibull model
wb <- survival::survreg(survival::Surv(time, cens) ~ 1, data = GBSG2)

# Compute the median survival from the model
predict(wb, type = "quantile", p = 0.5, newdata = data.frame(1))
##       1 
## 1693.93
# 70 Percent of patients survive beyond time point...
predict(wb, type = "quantile", p = 1-0.7, newdata = data.frame(1))
##        1 
## 1004.524
# Retrieve survival curve from model probabilities 
surv <- seq(.99, .01, by = -.01)

# Get time for each probability
t <- predict(wb, type = "quantile", p = 1 - surv, newdata = data.frame(1))

# Create data frame with the information
surv_wb <- data.frame(time = t, surv = surv)

# Look at first few lines of the result
head(surv_wb)
##       time surv
## 1  60.6560 0.99
## 2 105.0392 0.98
## 3 145.0723 0.97
## 4 182.6430 0.96
## 5 218.5715 0.95
## 6 253.3125 0.94
# Create data frame with the information needed for ggsurvplot_df
surv_wb <- data.frame(time = t, surv = surv, upper = NA, lower = NA, std.err = NA)

# Plot
survminer::ggsurvplot_df(fit = surv_wb, surv.geom = geom_line)


Chapter 3 - Weibull Model

Why Use the Weibull Model?

  • Often, there is a goal to predict the life expectancy based on covariates such as treatment, tumor size, etc.
    • wbmod <- survreg(Surv(time, cens) ~ horTh + tsize, data = GBSG2)
    • coef(wbmod)

Visualizing Weibull Models:

  • Steps to produce the visualization include
    • Compute Weibull model
    • Decide on “imaginary patients”
    • Compute survival curves
    • Create data.frame with survival curve information
    • Plot
  • Example steps
    • wbmod <- survreg(Surv(time, cens) ~ horTh + tsize, data = GBSG2) # compute model
    • newdat <- expand.grid( horTh = levels(GBSG2\(horTh), tsize = quantile(GBSG2\)tsize, probs = c(0.25, 0.5, 0.75)) ) # create all combinations of therapy and tumor sizes
    • surv <- seq(.99, .01, by = -.01)
    • t <- predict(wbmod, type = “quantile”, p = 1 - surv, newdata = newdat)
    • surv_wbmod_wide <- cbind(newdat, t)
    • surv_wbmod <- reshape2::melt(surv_wbmod_wide, id.vars = c(“horTh”, “tsize”), variable.name = “surv_id”, value.name = “time”)
    • surv_wbmod\(surv <- surv[as.numeric(surv_wbmod\)surv_id)]
    • surv_wbmod[, c(“upper”, “lower”, “std.err”, “strata”)] <- NA
    • ggsurvplot_df(surv_wbmod, surv.geom = geom_line, linetype = “horTh”, color = “tsize”, legend.title = NULL)

Other Distributions:

  • The choice of distribution will depend on the assumptions around the underlying physical process - will see different curves
    • survreg(Surv(time, cens) ~ horTh, data = GBSG2)
    • survreg(Surv(time, cens) ~ horTh, data = GBSG2, dist = “exponential”)
    • survreg(Surv(time, cens) ~ horTh, data = GBSG2, dist = “lognormal”)
  • The choice of distribution depends on both domain expertise and goodness of fit; Weibull models are generally more flexible than exponential models

Example code includes:

dfTime <- c(306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166, 170, 654, 728, 71, 567, 144, 613, 707, 61, 88, 301, 81, 624, 371, 394, 520, 574, 118, 390, 12, 473, 26, 533, 107, 53, 122, 814, 965, 93, 731, 460, 153, 433, 145, 583, 95, 303, 519, 643, 765, 735, 189, 53, 246, 689, 65, 5, 132, 687, 345, 444, 223, 175, 60, 163, 65, 208, 821, 428, 230, 840, 305, 11, 132, 226, 426, 705, 363, 11, 176, 791, 95, 196, 167, 806, 284, 641, 147, 740, 163, 655, 239, 88, 245, 588, 30, 179, 310, 477, 166, 559, 450, 364, 107, 177, 156, 529, 11, 429, 351, 15, 181, 283, 201, 524, 13, 212, 524, 288, 363, 442, 199, 550, 54, 558, 207, 92, 60, 551, 543, 293, 202, 353, 511, 267, 511, 371, 387, 457, 337, 201, 404, 222, 62, 458, 356, 353, 163, 31, 340, 229, 444, 315, 182, 156, 329, 364, 291, 179, 376, 384, 268, 292, 142, 413, 266, 194, 320, 181, 285, 301, 348, 197, 382, 303, 296, 180, 186, 145, 269, 300, 284, 350, 272, 292, 332, 285, 259, 110, 286, 270, 81, 131, 225, 269, 225, 243, 279, 276, 135, 79, 59, 240, 202, 235, 105, 224, 239, 237, 173, 252, 221, 185, 92, 13, 222, 192, 183, 211, 175, 197, 203, 116, 188, 191, 105, 174, 177)
dfStatus <- c(2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1)
dfSex <- factor(ifelse(c(1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2)==1, "male", "female"), levels=c("male", "female"))
dat <- data.frame(time=dfTime, status=dfStatus, sex=dfSex)


# Look at the data set
str(dat)
## 'data.frame':    228 obs. of  3 variables:
##  $ time  : num  306 455 1010 210 883 ...
##  $ status: num  2 2 1 2 2 1 2 2 2 2 ...
##  $ sex   : Factor w/ 2 levels "male","female": 1 1 1 1 1 1 2 2 1 1 ...
# Estimate a Weibull model
wbmod <- survival::survreg(survival::Surv(time, status) ~ sex, data = dat)
coef(wbmod)
## (Intercept)   sexfemale 
##    5.884162    0.395578
# Weibull model
wbmod <- survival::survreg(survival::Surv(time, cens) ~ horTh, data = GBSG2)

# Retrieve survival curve from model
surv <- seq(.99, .01, by = -.01)
t_yes <- predict(wbmod, type = "quantile", p = 1-surv, newdata = data.frame(horTh = "yes"))

# Take a look at survival curve
str(t_yes)
##  num [1:99] 76.4 131.4 180.9 227.2 271.4 ...
# Weibull model
wbmod <- survival::survreg(survival::Surv(time, cens) ~ horTh + tsize, data = GBSG2)

# Imaginary patients
newdat <- expand.grid(horTh = levels(GBSG2$horTh), 
                      tsize = quantile(GBSG2$tsize, probs = c(0.25, 0.5, 0.75))
                      )

# Compute survival curves
surv <- seq(.99, .01, by = -.01)
t <- predict(wbmod, type = "quantile", p = 1-surv, newdata = newdat)

# How many rows and columns does t have?
dim(t)
## [1]  6 99
# Use cbind() to combine the information in newdat with t
surv_wbmod_wide <- cbind(newdat, t)
  
# Use melt() to bring the data.frame to long format
surv_wbmod <- reshape2::melt(surv_wbmod_wide, id.vars = c("horTh", "tsize"), 
                             variable.name = "surv_id", value.name = "time"
                             )

# Use surv_wbmod$surv_id to add the correct survival probabilities surv
surv_wbmod$surv <- surv[as.numeric(surv_wbmod$surv_id)]

# Add columns upper, lower, std.err, and strata to the data.frame
surv_wbmod[, c("upper", "lower", "std.err", "strata")] <- NA

# Take a look at the structure of the object
str(surv_wbmod)
## 'data.frame':    594 obs. of  9 variables:
##  $ horTh  : Factor w/ 2 levels "no","yes": 1 2 1 2 1 2 1 2 1 2 ...
##  $ tsize  : num  20 20 25 25 35 35 20 20 25 25 ...
##  $ surv_id: Factor w/ 99 levels "1","2","3","4",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ time   : num  65.9 90 62 84.6 54.9 ...
##  $ surv   : num  0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.98 0.98 0.98 ...
##  $ upper  : logi  NA NA NA NA NA NA ...
##  $ lower  : logi  NA NA NA NA NA NA ...
##  $ std.err: logi  NA NA NA NA NA NA ...
##  $ strata : logi  NA NA NA NA NA NA ...
# Plot the survival curves
survminer::ggsurvplot_df(surv_wbmod, surv.geom = geom_line,
                         linetype = "horTh", color = "tsize", legend.title = NULL
                         )

# Weibull model
wbmod <- survival::survreg(survival::Surv(time, cens) ~ horTh, data = GBSG2)

# Log-Normal model
lnmod <- survival::survreg(survival::Surv(time, cens) ~ horTh, data = GBSG2, dist = "lognormal")

# Newdata
newdat <- data.frame(horTh = levels(GBSG2$horTh))

# Surv
surv <- seq(0.99, .01, by = -.01)

# Survival curve from Weibull model and log-normal model
wbt <- predict(wbmod, type = "quantile", p = 1-surv, newdata = newdat)
lnt <- predict(lnmod, type = "quantile", p = 1-surv, newdata = newdat)


dfWbtLnt <- as.data.frame(rbind(wbt, lnt))
names(dfWbtLnt) <- as.character(1:99)

surv_wide <- cbind(data.frame(horTh=factor(c("no", "yes", "no", "yes"), levels=c("no", "yes"))), 
                   dfWbtLnt, 
                   data.frame(dist=factor(c("weibull", "weibull", "lognormal", "lognormal")))
                   )

# Melt the data.frame into long format.
surv_long <- reshape2::melt(surv_wide, id.vars = c("horTh", "dist"), 
                            variable.name = "surv_id", 
                            value.name = "time"
                            )

# Add column for the survival probabilities
surv_long$surv <- surv[as.numeric(surv_long$surv_id)]

# Add columns upper, lower, std.err, and strata contianing NA values
surv_long[, c("upper", "lower", "std.err", "strata")] <- NA


# Plot the survival curves
survminer::ggsurvplot_df(surv_long, surv.geom = geom_line, 
                         linetype = "horTh", color = "dist", legend.title = NULL
                         )


Chapter 4 - Cox Model

Cox Model - most widely used model in survival analysis:

  • Semi-parametric (vs. Weibull model which is fully parametric)
  • Also called the “proportional hazards” model; instantaneous probability is assumed to be proportional, meaning that curves cannot cross
  • Computing the Cox model is very similar to the Weibull model; will be no intercept, and negative coefficient mean positive impact on survival
    • cxmod <- coxph(Surv(time, cens) ~ horTh, data = GBSG2)

Visualizing the Cox Model:

  • Five steps for visualizing the Cox model
    • Compute Cox model
    • Decide on covariate combinations (“imaginary patients”)
    • Compute survival curves
    • Create data.frame with survival curve information
    • Plot
  • Example code includes
    • cxmod <- coxph(Surv(time, cens) ~ horTh + tsize, data = GBSG2)
    • newdat <- expand.grid( horTh = levels(GBSG2\(horTh), tsize = quantile(GBSG2\)tsize, probs = c(0.25, 0.5, 0.75)) )
    • rownames(newdat) <- letters[1:6]
    • cxsf <- survfit(cxmod, data = GBSG2, newdata = newdat, conf.type = “none”)
    • surv_cxmod0 <- surv_summary(cxsf)
    • surv_cxmod <- cbind(surv_cxmod0, newdat[as.character(surv_cxmod0$strata), ])
    • ggsurvplot_df(surv_cxmod, linetype = “horTh”, color = “tsize”, legend.title = NULL, censor = FALSE)

Recap:

  • Concepts - survival analysis, censoring, survival curves
  • Methods - Kaplan-Meier, Weibull, Cox
  • Focus - understand, estimate, and visualize survival curves

Wrap Up:

  • Ability to interpret model estimates is a valuable skill is a useful follow-up step
    • Statistical inference, confidence intervals, etc.
  • Competing risk models is a valuable follow-on skill
  • Many additional modeling libraries available on CRAN

Example code includes:

dat$performance <- c(90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 90, 60, 80, 80, 90, 50, 60, 90, 80, 100, 70, 90, 90, 90, 100, 60, 80, 70, 90, 60, 60, 50, 70, 50, 70, 70, 50, 80, 80, 60, 90, 70, 60, 60, 90, 80, 90, 90, 90, 80, 90, 100, 90, 90, 100, 70, 80, 90, 70, 90, 80, 90, 80, 70, 70, 90, 100, 80, 90, 80, 70, 80, 90, 90, 100, 80, 90, 90, 100, 70, 80, 80, 80, 80, 80, 100, 90, 70, 100, 80, 90, 80, 100, 80, 80, 90, 90, 90, 100, 80, 70, 90, 50, 80, 80, 90, 100, 60, 90, 80, 80, 90, 80, 70, 70, 60, 70, 80, 90, 70, 70, 60, 90, 80, 80, 80, 80, 90, 80, 80, 100, 80, 90, 60, 80, 80, 90, 100, 70, 80, 70, 80, 80, 90, 100, 90, 100, 100, 70, 90, 90, 80, 70, 70, 90, 70, 80, 80, 90, 90, 60, 90, 80, 90, 80, 100, 90, 100, 90, 90, 90, 100, 90, 80, 60, 80, 80, 100, 100, 100, 90, 80, 90, 90, 70, 90, 80, 90, 80, 60, 90, 90, 90, 100, 80, 90, 100, 90, 90, 60, 90, 100, 100, NA, 80, 60, 80, 90, 100, 80, 90, 70, 80, 90, 90, 80, 70, 80, 80, 80, 80, 80, 90, 60, 90, 80)
str(dat)
## 'data.frame':    228 obs. of  4 variables:
##  $ time       : num  306 455 1010 210 883 ...
##  $ status     : num  2 2 1 2 2 1 2 2 2 2 ...
##  $ sex        : Factor w/ 2 levels "male","female": 1 1 1 1 1 1 2 2 1 1 ...
##  $ performance: num  90 90 90 90 100 50 70 60 70 70 ...
# Compute Cox model
cxmod <- survival::coxph(survival::Surv(time, status) ~ performance, data = dat)

# Show model coefficient
coef(cxmod)
## performance 
## -0.01644821
# Cox model
cxmod <- survival::coxph(survival::Surv(time, cens) ~ horTh + tsize, data = GBSG2)

# Imaginary patients
newdat <- expand.grid(horTh = levels(GBSG2$horTh), 
                      tsize = quantile(GBSG2$tsize, probs = c(0.25, 0.5, 0.75))
                      )
rownames(newdat) <- letters[1:6]

# Compute survival curves
cxsf <- survival::survfit(cxmod, data = GBSG2, newdata = newdat, conf.type = "none")

# Look at first 6 rows of cxsf$surv and time points
head(cxsf$surv)
##      a b c d e f
## [1,] 1 1 1 1 1 1
## [2,] 1 1 1 1 1 1
## [3,] 1 1 1 1 1 1
## [4,] 1 1 1 1 1 1
## [5,] 1 1 1 1 1 1
## [6,] 1 1 1 1 1 1
head(cxsf$time)
## [1]  8 15 16 17 18 29
# Remove conf.type="none" per https://github.com/kassambara/survminer/issues/355
cxsf <- survival::survfit(cxmod, data = GBSG2, newdata = newdat)

# Compute data.frame needed for plotting
surv_cxmod0 <- survminer::surv_summary(cxsf)

# Get a character vector of patient letters (patient IDs)
pid <- as.character(surv_cxmod0$strata)

# Multiple of the rows in newdat so that it fits with surv_cxmod0
m_newdat <- newdat[pid, ]

# Add patient info to data.frame
surv_cxmod <- cbind(surv_cxmod0, m_newdat)

# Plot
survminer::ggsurvplot_df(surv_cxmod, linetype = "horTh", color = "tsize", 
                         legend.title = NULL, censor = FALSE
                         )

# Compute Cox model and survival curves
cxmod <- survival::coxph(survival::Surv(time, status) ~ performance, data = dat)
new_lung <- data.frame(performance = c(60, 70, 80, 90))
cxsf <- survival::survfit(cxmod, data = dat, newdata = new_lung)

# Use the summary of cxsf to take a vector of patient IDs
surv_cxmod0 <- survminer::surv_summary(cxsf)
pid <- as.character(surv_cxmod0$strata)

# Duplicate rows in newdat to fit with surv_cxmod0 and add them in
m_newdat <- new_lung[pid, , drop = FALSE]
surv_cxmod <- cbind(surv_cxmod0, m_newdat)

# Plot
survminer::ggsurvplot_df(surv_cxmod, color = "performance", legend.title = NULL, censor = FALSE)

# Compute Kaplan-Meier curve
km <- survival::survfit(survival::Surv(time, status) ~ 1, data = dat)

# Compute Cox model
cxmod <- survival::coxph(survival::Surv(time, status) ~ performance, data = dat)

# Compute Cox model survival curves
new_lung <- data.frame(performance = c(60, 70, 80, 90))
cxsf <- survival::survfit(cxmod, data = dat, newdata = new_lung)

# Plot Kaplan-Meier curve
survminer::ggsurvplot(km, conf.int = FALSE)

# Plot Cox model survival curves
survminer::ggsurvplot(cxsf, censor = FALSE)


Building Response Models in R

Chapter 1 - Response Models for Product Sales

Fundamentals of Market Response Models:

  • Market response models are statistical tools to optimize A&P for the marketing mix
  • Dataset for the course will be for retial sales
    • str(sales.data)

Linear Response Models:

  • The model will have sales as the response variable and the other variables as predictors
  • Example of the simple linear model for Sales vs. Price
    • linear.model <- lm(SALES ~ PRICE , data = sales.data)
    • coef(linear.model)
    • coef(linear.model)[1] + 0.95 * coef(linear.model)[2]
    • plot(SALES ~ PRICE, data = sales.data)
    • abline(coef(linear.model))

Non-Linear Response Models:

  • Non-linear response is a more common real-world marketing mix behavior
  • The exponential response function assumes a constant elasticity
    • Sales = B0 * exp(B1 * Price)
    • log(Sales) = log(B0) + (B1*Price)
    • log.model <- lm(log(SALES) ~ PRICE, data = sales.data)
    • coef(log.model)
    • plot(log(SALES) ~ PRICE, data = sales.data)
    • log.model <- lm(log(SALES) ~ PRICE, data = sales.data)
    • abline(coef(log.model))

Example code includes:

load("./RInputFiles/sales.data.RData")
load("./RInputFiles/choice.data.RData")
str(sales.data)
## 'data.frame':    124 obs. of  6 variables:
##  $ OBS          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SALES        : num  61.8 11.5 61.6 38.3 31.7 ...
##  $ PRICE        : num  1.09 1.27 1.27 1.27 1.27 ...
##  $ DISPLAY      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUPON       : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ DISPLAYCOUPON: int  0 0 0 0 0 0 0 0 0 0 ...
str(choice.data)
## 'data.frame':    2798 obs. of  13 variables:
##  $ OBS          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ HOUSEHOLDID  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ LASTPURCHASE : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BUD          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ HOPPINESS    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PRICE.BUD    : num  0.052 0.052 0.046 0.052 0.046 ...
##  $ PRICE.HOP    : num  0.034 0.044 0.048 0.034 0.048 ...
##  $ DISPL.BUD    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DISPL.HOP    : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ FEAT.BUD     : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ FEAT.HOP     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FEATDISPL.BUD: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FEATDISPL.HOP: int  0 0 0 0 0 0 1 0 0 0 ...
# Tail of sales.data
tail(sales.data)
##     OBS    SALES    PRICE DISPLAY COUPON DISPLAYCOUPON
## 119 119 11.82116 0.987500       0      0             0
## 120 120 14.36825 0.987500       0      0             0
## 121 121 34.71571 0.987500       0      0             0
## 122 122 52.61871 0.987500       0      0             0
## 123 123 52.41674 1.026786       0      0             0
## 124 124 49.35723 1.050000       0      0             0
# Mean SALES
mean(sales.data$SALES)
## [1] 118.8262
# Minimum SALES
min(sales.data$SALES)
## [1] 11.4692
# Maximum SALES
max(sales.data$SALES)
## [1] 1406.698
# Linear model explaining SALES by PRICE
linear.model <- lm(SALES ~ PRICE, data = sales.data)

# Obtain the model coefficients
coef(linear.model)
## (Intercept)       PRICE 
##    274.2486   -134.3097
# Obtain the intercept coefficient
coef(linear.model)[1]
## (Intercept) 
##    274.2486
# Obtain the slope coefficient
coef(linear.model)[2]
##     PRICE 
## -134.3097
# Predict the SALES for the decreased PRICE of 1.05 
coef(linear.model)[1] + 1.05 * coef(linear.model)[2]
## (Intercept) 
##    133.2234
# Predict the SALES for the decreased PRICE of 0.95 
coef(linear.model)[1] + 0.95 * coef(linear.model)[2]
## (Intercept) 
##    146.6544
# Linear model explaining SALES by PRICE
linear.model <- lm(SALES ~ PRICE, data = sales.data)

# Plot SALES against PRICE
plot(SALES ~ PRICE, data = sales.data)

# Adding the model predictions
abline(coef(linear.model))

# Linear model explaining log(SALES) by PRICE
log.model <- lm(log(SALES) ~ PRICE, data=sales.data)

# Obtaining the model coefficients
coef(log.model)
## (Intercept)       PRICE 
##   5.0843983  -0.6622516
# Plot log(SALES) against PRICE
plot(log(SALES) ~ PRICE, data=sales.data)

# Linear model explaining log(SALES) by PRICE
log.model <- lm(log(SALES) ~ PRICE, data=sales.data)

# Adding the model predictions
abline(coef(log.model))


Chapter 2 - Extended Sales Response Modeling

Model Extension Part 1: Dummy Variables:

  • Dummy variables typically take on the value of 0 or 1
  • Can start by summarizing dummy variables in the sales.data file
    • table(sales.data$DISPLAY)
    • table(sales.data\(DISPLAY)/sum(table(sales.data\)DISPLAY))
    • mean(sales.data$DISPLAY)
  • Can also include dummy variables in the regression
    • dummy.model <- lm(log(SALES) ~ DISPLAY, data = sales.data)
    • coef(dummy.model)
    • exp(coef(dummy.model)[1]) # Average unit sales for no display
    • exp(coef(dummy.model)[2] - 1) # Percentage increase in sales for a display
  • Can add dummy variables for areas like discounts
    • summary(sales.data[,c(“DISPLAY”,“COUPON”,“DISPLAYCOUPON”)])
    • dummy.model <- lm(log(SALES) ~ DISPLAY + COUPON + DISPLAYCOUPON, data = sales.data)
    • coef(dummy.model)
    • lm(update(dummy.model, . ~ . + PRICE), data = sales.data) # adds PRICE to the model

Model Extensions Part 2: Dynamic Variables:

  • The carryover effect is the time span between marketing activities and response times
    • Typically evaluated using lags
    • head(cbind(sales.data\(PRICE, lag(sales.data\)PRICE, n = 1)))
    • Price.lag <- lag(sales.data$PRICE)
    • lag.model <- lm(log(SALES) ~ PRICE + Price.lag, data = sales.data)
    • Coupon.lag <- lag(sales.data$COUPON)
    • lm(update(lag.model, . ~ . + COUPON + Coupon.lag), data = sales.data)
    • lag.model <- lm(log(SALES) ~ PRICE + Price.lag + DISPLAY + Display.lag + COUPON + Coupon.lag + DISPLAYCOUPON + DisplayCoupon.lag, data = sales.data)
    • plot(log(SALES) ~ OBS, data = sales.data)
    • lines(c(NA, fitted.values(lag.model)) ~ OBS, data = sales.data)

Number of Extensions Needed:

  • Can summarize the model to view top-level findings, including R-squared and p-values for each variable
    • summary(extended.model)
    • AIC(extended.model) # information criteria
    • AIC(lm(update(extended.model, . ~ . - Coupon.lag), data = sales.data))
  • Can also run an elimination process using MASS
    • library(MASS)
    • final.model <- stepAIC(extended.model, direction = “backward”, trace = FALSE)
    • summary(final.model)

Example code includes:

# Proportion of DISPLAY and no-DISPLAY activity
table(sales.data$DISPLAY) / sum(table(sales.data$DISPLAY))
## 
##        0        1 
## 0.733871 0.266129
# Mean of DISPLAY activity
mean(sales.data$DISPLAY)
## [1] 0.266129
# Mean of no-DISPLAY activity
1 - mean(sales.data$DISPLAY)
## [1] 0.733871
# Linear model explaining log(SALES) by DISPLAY
dummy.model <- lm(log(SALES) ~ DISPLAY, data = sales.data)

# Obtaining the coefficients
coef(dummy.model)
## (Intercept)     DISPLAY 
##   4.1949532   0.4625243
# Mean DISPLAY activity
mean(sales.data$DISPLAY)
## [1] 0.266129
# Mean COUPON activity
mean(sales.data$COUPON)
## [1] 0.09677419
# Mean DISPLAY and COUPON activity
mean(sales.data$DISPLAYCOUPON)
## [1] 0.05645161
# Summarize DISPLAY, COUPON, DISPLAYCOUPON activity
summary(sales.data[,c("DISPLAY", "COUPON", "DISPLAYCOUPON")])
##     DISPLAY           COUPON        DISPLAYCOUPON    
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.2661   Mean   :0.09677   Mean   :0.05645  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000
# Linear model explaining log(SALES) by DISPLAY, COUPON and DISPLAYCOUPON
dummy.model <- lm(log(SALES) ~ DISPLAY + COUPON + DISPLAYCOUPON, data = sales.data)

# Obtain the model coefficients
coef(dummy.model)
##   (Intercept)       DISPLAY        COUPON DISPLAYCOUPON 
##     3.7975707     0.8599068     1.7597567     2.1492468
# Dummy.mod updated for PRICE
update(dummy.model, . ~ . + PRICE)
## 
## Call:
## lm(formula = log(SALES) ~ DISPLAY + COUPON + DISPLAYCOUPON + 
##     PRICE, data = sales.data)
## 
## Coefficients:
##   (Intercept)        DISPLAY         COUPON  DISPLAYCOUPON          PRICE  
##        3.4310         0.8747         1.7646         2.1630         0.3123
# Compare lagged PRICE and original PRICE
head(cbind(sales.data$PRICE, lag(sales.data$PRICE)))
##          [,1]     [,2]
## [1,] 1.090000       NA
## [2,] 1.271818 1.090000
## [3,] 1.271818 1.271818
## [4,] 1.271818 1.271818
## [5,] 1.271818 1.271818
## [6,] 1.271818 1.271818
# Create the lagged PRICE variable
Price.lag <- lag(sales.data$PRICE)

# Linear model explaining log(SALES) by PRICE and Price.lag
lag.model <- lm(log(SALES) ~ PRICE + Price.lag, data = sales.data)

# Obtain the coefficients
coef(lag.model)
## (Intercept)       PRICE   Price.lag 
##    3.905902   -4.578985    4.934948
# Create the lagged COUPON variable
Coupon.lag <-  lag(sales.data$COUPON)

# Update lag.model for COUPON and C_lag
update(lag.model, . ~ . + COUPON + Coupon.lag)
## 
## Call:
## lm(formula = log(SALES) ~ PRICE + Price.lag + COUPON + Coupon.lag, 
##     data = sales.data)
## 
## Coefficients:
## (Intercept)        PRICE    Price.lag       COUPON   Coupon.lag  
##       3.833       -4.505        4.843        1.354       -0.384
sales.data2 <- sales.data %>%
    mutate(Price.lag = lag(PRICE, 1), 
           Display.lag = lag(DISPLAY, 1),
           Coupon.lag = lag(COUPON, 1),
           DisplayCoupon.lag = lag(DISPLAYCOUPON, 1)
           )

# Extended sales resonse model
extended.model <- lm(log(SALES) ~ PRICE + Price.lag + DISPLAY + Display.lag + COUPON + 
                         Coupon.lag + DISPLAYCOUPON + DisplayCoupon.lag, data = sales.data2
                     )

# Plot log(SALES) against OBS
plot(log(SALES) ~ OBS, data = sales.data2)

# Add the model predictions
lines(c(NA, fitted.values(extended.model)) ~ OBS, data = sales.data2)

# Summarize the model
summary(extended.model)
## 
## Call:
## lm(formula = log(SALES) ~ PRICE + Price.lag + DISPLAY + Display.lag + 
##     COUPON + Coupon.lag + DISPLAYCOUPON + DisplayCoupon.lag, 
##     data = sales.data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27541 -0.31110  0.01536  0.34871  0.92009 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.2561     0.5654   3.991 0.000117 ***
## PRICE              -2.6857     0.7921  -3.390 0.000959 ***
## Price.lag           3.9920     0.7959   5.016 1.96e-06 ***
## DISPLAY             0.4570     0.1279   3.572 0.000521 ***
## Display.lag         0.5097     0.1180   4.319 3.36e-05 ***
## COUPON              1.7531     0.1576  11.121  < 2e-16 ***
## Coupon.lag         -0.2098     0.1567  -1.339 0.183344    
## DISPLAYCOUPON       2.0087     0.2017   9.960  < 2e-16 ***
## DisplayCoupon.lag   0.4489     0.2112   2.126 0.035695 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5 on 114 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7135, Adjusted R-squared:  0.6934 
## F-statistic:  35.5 on 8 and 114 DF,  p-value: < 2.2e-16
# AIC of the extended response model
AIC(extended.model)
## [1] 189.21
# Single term deletion
AIC(lm(update(extended.model, . ~ . -Coupon.lag), data = sales.data2))
## [1] 189.1284
# Backward elemination
final.model <- MASS::stepAIC(extended.model, direction = "backward", trace = FALSE)

# Summarize the final model
summary(final.model)
## 
## Call:
## lm(formula = log(SALES) ~ PRICE + Price.lag + DISPLAY + Display.lag + 
##     COUPON + DISPLAYCOUPON + DisplayCoupon.lag, data = sales.data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24037 -0.31964  0.01535  0.35218  0.94686 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.1887     0.5651   3.873 0.000179 ***
## PRICE              -2.6888     0.7949  -3.383 0.000982 ***
## Price.lag           4.0267     0.7982   5.045 1.71e-06 ***
## DISPLAY             0.4524     0.1283   3.525 0.000609 ***
## Display.lag         0.5447     0.1155   4.717 6.78e-06 ***
## COUPON              1.7635     0.1580  11.161  < 2e-16 ***
## DISPLAYCOUPON       1.9954     0.2021   9.872  < 2e-16 ***
## DisplayCoupon.lag   0.4839     0.2103   2.301 0.023182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5017 on 115 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.709,  Adjusted R-squared:  0.6913 
## F-statistic: 40.04 on 7 and 115 DF,  p-value: < 2.2e-16

Chapter 3 - Response Models for Individual Demand

Models for Individual Demand:

  • Data are available about customer purchases
    • str(choice.data)
    • OBS-ervation week
    • HOUSEHOLDID of the purchase records
    • LASTPURCHASE recorded of the household
  • The probability to purchase cannnot be well modeled with a purely linear model
    • probability.model <- lm(HOPPINESS ~ PRICE.HOP, data = choice.data)
    • plot(HOPPINESS ~ PRICE.HOP, data = choice.data) abline(probability.model)
    • abline(probability.model)
  • Can define a new variable, price ratio
    • price.ratio <- log(choice.data\(PRICE.HOP/choice.data\)PRICE.BUD)
    • probability.model <- lm(HOPPINESS ~ price.ratio, data = choice.data)
    • plot(HOPPINESS ~ price.ratio, data = choice.data)
    • abline(probability.model)

Logistic Response Models:

  • The logistic response function is a better predictor for choice data since it is bounded between 0 and 1
    • logistic.model <- glm(HOPPINESS ~ price.ratio, family = binomial, data = choice.data)
    • coef(logistic.model)
    • plot(HOPPINESS ~ price.ratio, data = choice.data)
    • curve(predict(logistic.model, data.frame(price.ratio = x), type = “response”), add = TRUE)
    • margins(logistic.model)
  • Can also look at an effects plot using the margins package
    • x <- seq(-1.25, 1.25, by = 0.25)
    • cplot(logistic.model, “price.diff”, xvals = x)

Probit Response Models:

  • Can use the probit response function in lieu of the logit response function, with slight changes to the tails
    • probit.model <- glm(HOPPINESS ~ price.ratio, family = binomial(link = probit), data = choice.data)
    • coef(probit.model)
    • cbind(coef(logistic.model), coef(probit.model))
    • margins(logistic.model) # interpretable log-odds
    • margins(probit.model) # non-interpretable z-values

Example code includes:

# Structure of choice.data
str(choice.data)
## 'data.frame':    2798 obs. of  13 variables:
##  $ OBS          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ HOUSEHOLDID  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ LASTPURCHASE : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BUD          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ HOPPINESS    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PRICE.BUD    : num  0.052 0.052 0.046 0.052 0.046 ...
##  $ PRICE.HOP    : num  0.034 0.044 0.048 0.034 0.048 ...
##  $ DISPL.BUD    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DISPL.HOP    : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ FEAT.BUD     : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ FEAT.HOP     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FEATDISPL.BUD: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FEATDISPL.HOP: int  0 0 0 0 0 0 1 0 0 0 ...
# Summarize purchases of HOPPINESS, BUD and PRICE.HOP and PRICE.BUD
summary(choice.data[,c("HOPPINESS", "BUD", "PRICE.HOP", "PRICE.BUD")]) 
##    HOPPINESS           BUD           PRICE.HOP         PRICE.BUD      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00300   Min.   :0.00100  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.03100   1st Qu.:0.03000  
##  Median :0.0000   Median :1.0000   Median :0.03400   Median :0.03400  
##  Mean   :0.1001   Mean   :0.8999   Mean   :0.03355   Mean   :0.03483  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.03600   3rd Qu.:0.04200  
##  Max.   :1.0000   Max.   :1.0000   Max.   :0.08700   Max.   :0.06100
# Plot HOPPINESS against PRICE.HOP
plot(HOPPINESS ~ PRICE.HOP, data = choice.data)

# Linear probability model explaining HOPPINESS by PRICE.HOP
probability.model <- lm(HOPPINESS ~ PRICE.HOP, data = choice.data)

# Add the model predictions
abline(coef(probability.model))

# Calculate the price ratio for HOPPINESS and BUD
choice.data$price.ratio <- log(choice.data$PRICE.HOP / choice.data$PRICE.BUD)

# Plot HOPPINESS purchases against the price ratio
plot(HOPPINESS ~ price.ratio, data = choice.data)

# Linear probability model explaining HOPPINESS by price.ratio
probability.model <- lm(HOPPINESS ~ price.ratio, data = choice.data)

# Add the model predictions
abline(probability.model)

# Logistic model explaining HOPPINESS by price.ratio
logistic.model <- glm(HOPPINESS ~ price.ratio, family = binomial, data = choice.data)

# Obtain the coefficients
coef(logistic.model)
## (Intercept) price.ratio 
##   -3.572678   -6.738768
# Plot HOPPINESS choices against price.diff
plot(HOPPINESS ~ price.ratio, data = choice.data)

# Add the predictions of the logistic model
curve(predict(logistic.model, data.frame(price.ratio = x), type = "response"), add = TRUE)

# Linear probability model
coef(probability.model)
## (Intercept) price.ratio 
##  0.09700236 -0.29594939
# Logistic model
margins::margins(logistic.model)
## Average marginal effects
## glm(formula = HOPPINESS ~ price.ratio, family = binomial, data = choice.data)
##  price.ratio
##      -0.4585
# Sequence of x values
x <- seq(-1, 1, by = 0.10)

# Conditional effect plot
margins::cplot(logistic.model, "price.ratio", xvals = x)
##    xvals        yvals        upper        lower
## 1   -1.0 0.9595380563 0.9814489158 9.376272e-01
## 2   -0.9 0.9235941387 0.9580857059 8.891026e-01
## 3   -0.8 0.8603664046 0.9101362487 8.105966e-01
## 4   -0.7 0.7584975129 0.8210699539 6.959251e-01
## 5   -0.6 0.6155217264 0.6799292789 5.511142e-01
## 6   -0.5 0.4493508612 0.5011249497 3.975768e-01
## 7   -0.4 0.2937644409 0.3271033760 2.604255e-01
## 8   -0.3 0.1749350431 0.1954437612 1.544263e-01
## 9   -0.2 0.0975345104 0.1119672207 8.310180e-02
## 10  -0.1 0.0522128673 0.0628159496 4.160979e-02
## 11   0.0 0.0273135795 0.0346981044 1.992905e-02
## 12   0.1 0.0141114813 0.0189496002 9.273362e-03
## 13   0.2 0.0072431374 0.0102678379 4.218437e-03
## 14   0.3 0.0037051921 0.0055333871 1.876997e-03
## 15   0.4 0.0018920795 0.0029702318 8.139273e-04
## 16   0.5 0.0009653426 0.0015895470 3.411382e-04
## 17   0.6 0.0004922958 0.0008485618 1.360297e-04
## 18   0.7 0.0002509978 0.0004520395 4.995612e-05
## 19   0.8 0.0001279565 0.0002403579 1.555513e-05
## 20   0.9 0.0000652272 0.0001275873 2.867069e-06

# Probit model explaining HOPPINESS by price.ratio
probit.model <- glm(HOPPINESS ~ price.ratio, family = binomial(link=probit), data = choice.data)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Obtain the coefficients
coef(probit.model)
## (Intercept) price.ratio 
##   -1.954092   -3.547546
# Compare the coefficients
cbind(coef(probit.model), coef(logistic.model))
##                  [,1]      [,2]
## (Intercept) -1.954092 -3.572678
## price.ratio -3.547546 -6.738768
# Logistic model
margins::margins(logistic.model)
## Average marginal effects
## glm(formula = HOPPINESS ~ price.ratio, family = binomial, data = choice.data)
##  price.ratio
##      -0.4585
# Probit model
margins::margins(probit.model)
## Average marginal effects
## glm(formula = HOPPINESS ~ price.ratio, family = binomial(link = probit),     data = choice.data)
##  price.ratio
##      -0.4503

Chapter 4 - Extended Demand Modeling

Model Selection:

  • Can extend the model to include display feature data
    • summary(choice.data[,c(“FEAT.HOP”,“DISPL.HOP”,“FEATDISPL.HOP”)])
    • extended.model <- glm(HOPPINESS ~ price.ratio + DISPL.HOP + FEAT.HOP + FEATDISPL.HOP, family = binomial, data = choice.data)
    • margins(extended.model)
    • summary(extended.model)
  • The null deviance and the residual deviance can be assessed using an intercept-only model and an ANOVA
    • null.model <- glm(HOPPINESS ~ 1, family = binomial, data = choice.data)
    • anova(extended.model, null.model, test = “Chisq”)
  • Can also run a stepwise model and assess using AIC
    • final.model <- stepAIC(extended.model, direction = “backward”, trace = FALSE)
    • summary(final.model)

Predictive Performance:

  • Can use cutoffs on the predicted probabilities to create a classifier
    • predicted <- ifelse(fitted.values(extended.mod) >= 0.5, 1, 0)
    • observed <- choice.data$HOPPINESS
    • table(observed, predicted)/2798 # There are 2798 samples (confusion matrix)
    • Roc <- pROC::roc(predictor = fitted.values(extended.mod), response = observed)
    • plot(Roc)

Model Validation:

  • Can evaluate model on independent (unseen) datasets - example of subsetting on LASTPURCHASE
    • train.data <- subset(choice.data, subset = LASTPURCHASE == 0)
    • test.data <- subset(choice.data, subset = LASTPURCHASE == 1)
    • train.model <- glm(HOPPINESS ~ price.diff + FEAT.HOP + FEATDISPL.HOP, family = binomial, data = train.data)
    • margins(train.model)
    • prob <- predict(train.model, test.data, type = “response”)
    • predicted <- ifelse(prob >= 0.5, 1, 0)
    • observed <- test.data$HOPPINESS
    • table(predicted, observed)/300

Wrap Up:

  • Linear and non-linear response models
  • Logit and probit models
  • Dummy and lagged variables
  • Test and control approaches

Example code includes:

# Summarizing DISPLAY.HOP, FEAT.HOP, FEATDISPL.HOP actions
summary(choice.data[, c("DISPL.HOP", "FEAT.HOP", "FEATDISPL.HOP")])
##    DISPL.HOP          FEAT.HOP       FEATDISPL.HOP     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00000   Median :0.000000  
##  Mean   :0.03538   Mean   :0.03645   Mean   :0.009292  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000000
# Logistic model explaining HOPPINESS by price.diff, DISPL.HOP, FEAT.HOP, FEATDISPL.HOP
extended.model <- glm(HOPPINESS ~ price.ratio + DISPL.HOP + FEAT.HOP + FEATDISPL.HOP, 
                      family = binomial, data  = choice.data
                      )

# Marginal effects for the extended logistic response model
margins::margins(extended.model)
## Average marginal effects
## glm(formula = HOPPINESS ~ price.ratio + DISPL.HOP + FEAT.HOP +     FEATDISPL.HOP, family = binomial, data = choice.data)
##  price.ratio DISPL.HOP FEAT.HOP FEATDISPL.HOP
##      -0.4471  0.009486  0.04973        0.1086
# Summarize the model
summary(extended.model)
## 
## Call:
## glm(formula = HOPPINESS ~ price.ratio + DISPL.HOP + FEAT.HOP + 
##     FEATDISPL.HOP, family = binomial, data = choice.data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.02676  -0.43078  -0.22935  -0.07645   3.10246  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.6250     0.1447 -25.053  < 2e-16 ***
## price.ratio    -6.6663     0.4127 -16.154  < 2e-16 ***
## DISPL.HOP       0.1415     0.2599   0.544 0.586204    
## FEAT.HOP        0.7415     0.3780   1.962 0.049806 *  
## FEATDISPL.HOP   1.6189     0.4789   3.381 0.000723 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1820.0  on 2797  degrees of freedom
## Residual deviance: 1275.8  on 2793  degrees of freedom
## AIC: 1285.8
## 
## Number of Fisher Scoring iterations: 7
# Null model explaining HOPPINESS by the intercept only
null.model <- glm(HOPPINESS ~ 1, family = binomial, data = choice.data)

# Compare null.mod against extended.mod
anova(extended.model, null.model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: HOPPINESS ~ price.ratio + DISPL.HOP + FEAT.HOP + FEATDISPL.HOP
## Model 2: HOPPINESS ~ 1
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      2793     1275.8                          
## 2      2797     1820.0 -4  -544.23 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Backward elemination
final.model <- MASS::stepAIC(extended.model, direction = "backward", trace = FALSE)

# Summarize the final model
summary(final.model)
## 
## Call:
## glm(formula = HOPPINESS ~ price.ratio + FEAT.HOP + FEATDISPL.HOP, 
##     family = binomial, data = choice.data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.04400  -0.43243  -0.22914  -0.07575   3.10947  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.6268     0.1450 -25.018  < 2e-16 ***
## price.ratio    -6.7167     0.4033 -16.655  < 2e-16 ***
## FEAT.HOP        0.7327     0.3780   1.938 0.052577 .  
## FEATDISPL.HOP   1.6041     0.4789   3.349 0.000811 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1820.0  on 2797  degrees of freedom
## Residual deviance: 1276.1  on 2794  degrees of freedom
## AIC: 1284.1
## 
## Number of Fisher Scoring iterations: 7
# Classifying the predictions
predicted <- ifelse(fitted.values(extended.model) >= 0.5, 1, 0)

# Obtain the purchase predictions
table(predicted)
## predicted
##    0    1 
## 2703   95
# Obtain the observed purchases
observed <- choice.data$HOPPINESS

# Cross-tabulating the observed vs. the predicted purchases
table(predicted, observed)/2798
##          observed
## predicted          0          1
##         0 0.88849178 0.07755540
##         1 0.01143674 0.02251608
# Creating the Roc object
Roc <- pROC::roc(predictor = fitted.values(extended.model), response = observed)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot the ROC curve
pROC::plot.roc(Roc)

# Create the training dataset
train.data <- subset(choice.data, LASTPURCHASE == 0)

# Create the test dataset
test.data <- subset(choice.data, LASTPURCHASE == 1)


# Fit logistic response model to the training data set
train.model <- glm(HOPPINESS ~ price.ratio + FEAT.HOP + FEATDISPL.HOP, 
                   family = binomial, data = train.data
                   )


# Predict the purchase probabilities for test.data
prob <- predict(train.model, newdata=test.data, type = "response") 

# Classify the predictions
predicted <- ifelse(prob >= 0.5, 1, 0) 

# Obtain the observed purchases from test.data
observed <- test.data$HOPPINESS

# Cross-tabulate  the predicted vs. the observed purchases
table(predicted, observed)/300
##          observed
## predicted           0           1
##         0 0.923333333 0.063333333
##         1 0.006666667 0.006666667

Time Series with data.table in R

Chapter 1 - Review of data.table

Introduction:

  • Data frames are a general purpose data structure - rectangular structure reflecting a list of lists, though unlike matrices, the lists may be of multiple data types
  • The data.table package is an extension of data.frame - more efficient memory usage and more expressive syntax
    • library(data.table)
    • someDT <- data.table(x = rnorm(100), y = rep(TRUE, 100))
    • str(someDT) # Classes ‘data.table’ and ‘data.frame’: 100 obs. of 2 variables:
  • Can select columns using .()
    • baseballDT[, .(timestamp, winning_team)]
  • Can use .SD for column selection
    • cols <- c(“timestamp”, “winning_team”)
    • baseballDT[, .SD, .SDcols = cols]
  • Can use grep() to help with the matching
    • count_cols <- grep(‘COUNT$’, names(baseballDT), value = TRUE)
    • countDT <- baseballDT[, .SD, .SDcols = count_cols]
  • Can simultaneously select rows and columns, and they are objects inside the data.table environment
    • cols <- c(“timestamp”, “winning_team”)
    • baseballDT[ which.max(timestamp), .SD, .SDcols = cols ]

Flexible Data Selection:

  • The get() function allows for evaluating a string as a column reference
    • locDT <- data.table( cities = c(“Chicago”, “Boston”, “Milwaukee”), ppl_mil = c(2.7, 0.673, 0.595) )
    • city_col <- “cities”
    • locDT[, get(city_col)]
    • square_col <- function(DT, col_name){ return(DT[, get(col_name) ^ 2]) }
  • The () mean that you are accessing something that is external to the data.table
    • add_bil_ppl <- function(DT, new_name){ DT[, (new_name) := ppl_mil * 1000 }
    • add10 <- function(DT, cols){ for (col in cols){ new_name <- paste0(col, “_plus10“) ; DT[, (new_name) := get(col) + 10] } }
    • add10(locDT, cols = “ppl_mil”)
  • Can change names using setnames() - modifies the table in place (no copies)
    • setnames(locDT, old = “cities”, new = “city_names”)
    • tag_important_columns <- function(DT, cols){ setnames(DT, old = cols, new = paste0(cols, “_important“)) }
    • tag_important_columns(locDT, “ppl_mil”)

Executing Functions Inside data.tables:

  • Can evaluate functions inside the data.table operators
  • Functions in the “I” block can be used to select rows, with booleans used for subsetting
    • stockDT <- data.table( close_date = seq.POSIXt(as.POSIXct(“2017-01-01”), as.POSIXct(“2017-01-30”), length.out = 100), MSFT = runif(100, 70, 80), AAPL = runif(100, 140, 180) )
    • stockDT[close_date > max(close_date) - 60 * 60 * 8]
  • Functions in the “j” block can be used to summarize data
    • cor(stockDT[, .SD, .SDcols = c(‘AAPL’, ‘MSFT’)])
    • corr_mat <- stockDT[, cor(.SD), .SDcols = c(‘AAPL’, ‘MSFT’)]
    • stockDT[, rand_noise := AAPL + rnorm(100)]
  • Can use the “by” group to dynamically group data
    • stockDT[, hour_of_day := as.integer(strftime(close_date, “%H”))]
    • stockDT[, mean(AAPL), by = hour_of_day][order(hour_of_day)]
    • stockDT[, mean(AAPL), by = .( hour_of_day = as.integer(strftime(close_date, “%H”)) )][order(hour_of_day)] # same as above, but with a 1-step process
  • Since a data.table is a list of lists, can use lapply or sapply on the data.table
    • Use lapply() if you want a data.table back
    • Use sapply() if you want a vector or list back
    • stockDT[, lapply(.SD, function(x){mean(is.na(x))})]
    • num_obs <- stockDT[, sapply(.SD, function(x){sum(!is.na(x), na.rm = TRUE)})]

Example code includes:

library(data.table)

diagnosticDT <- as.data.table(feather::read_feather("./RInputFiles/Feather_Data/diagnosticDT.feather"))
str(diagnosticDT)
## Classes 'data.table' and 'data.frame':   100 obs. of  4 variables:
##  $ timestamp     : POSIXct, format: "2018-01-01 00:00:00" "2018-01-01 00:00:36" ...
##  $ engine_speed  : num  4325 5255 4566 5317 2739 ...
##  $ engine_temp   : num  58.3 100.2 146.7 100.4 87.8 ...
##  $ system_voltage: num  9177 9785 8248 9027 9327 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Select system voltage directly
voltageDT <- diagnosticDT[, .(timestamp, system_voltage)]

# Select system voltage with .SD
voltageDT <- diagnosticDT[, .SD, .SDcols = c("timestamp", "system_voltage")]

# Select system voltage with .SD + a vector of names
voltage_cols <- c("timestamp", "system_voltage")
voltageDT <- diagnosticDT[, .SD, .SDcols = voltage_cols]

diagnosticDT[which.max(timestamp), .SD, .SDcols=c("timestamp", "system_voltage")]
##              timestamp system_voltage
## 1: 2018-01-01 01:00:00       8031.055
# Store the names of all columns starting with "engine_" in a vector
engine_cols <- grep(pattern = "engine_", x = names(diagnosticDT), value = TRUE)

# Use that vector to create a new data.table with only engine signals
engineDT <- diagnosticDT[, .SD, .SDcols = engine_cols]


# Complete the function
add_interaction <- function(someDT, col1, col2){
    new_col_name <- paste0(col1, "_times_", col2)
    someDT[, (new_col_name) := get(col1) * get(col2)]
}

# Add an interaction
add_interaction(diagnosticDT, "engine_speed", "engine_temp")

# Check it out!
head(diagnosticDT)
##              timestamp engine_speed engine_temp system_voltage
## 1: 2018-01-01 00:00:00    4325.4234     58.2805       9177.185
## 2: 2018-01-01 00:00:36    5254.6392    100.2430       9785.184
## 3: 2018-01-01 00:01:12    4565.8940    146.6556       8247.590
## 4: 2018-01-01 00:01:49    5316.7474    100.3676       9026.618
## 5: 2018-01-01 00:02:25    2738.8858     87.7780       9327.280
## 6: 2018-01-01 00:03:01     998.2307    137.3255       9531.084
##    engine_speed_times_engine_temp
## 1:                       252087.8
## 2:                       526740.7
## 3:                       669614.1
## 4:                       533629.0
## 5:                       240413.9
## 6:                       137082.5
# Write a function to scale a column by 10
scale_by_10 <- function(someDT, col_to_scale, new_col_name){
    someDT[, (new_col_name) := get(col_to_scale) * 10]
}

# Try it out
scale_by_10(diagnosticDT, "engine_temp", "temp10")

# Check the state of the data.table
head(diagnosticDT)
##              timestamp engine_speed engine_temp system_voltage
## 1: 2018-01-01 00:00:00    4325.4234     58.2805       9177.185
## 2: 2018-01-01 00:00:36    5254.6392    100.2430       9785.184
## 3: 2018-01-01 00:01:12    4565.8940    146.6556       8247.590
## 4: 2018-01-01 00:01:49    5316.7474    100.3676       9026.618
## 5: 2018-01-01 00:02:25    2738.8858     87.7780       9327.280
## 6: 2018-01-01 00:03:01     998.2307    137.3255       9531.084
##    engine_speed_times_engine_temp   temp10
## 1:                       252087.8  582.805
## 2:                       526740.7 1002.430
## 3:                       669614.1 1466.556
## 4:                       533629.0 1003.676
## 5:                       240413.9  877.780
## 6:                       137082.5 1373.255
# Write a function that squares every numeric column
add_square_features <- function(someDT, cols){
    for (col_name in cols){
        new_col_name <- paste0(col_name, "_squared")
        someDT[, (new_col_name) := get(col_name)^2 ]
    }
}

# Look at the difference!
add_square_features(diagnosticDT, c("engine_speed", "engine_temp", "system_voltage"))
head(diagnosticDT)
##              timestamp engine_speed engine_temp system_voltage
## 1: 2018-01-01 00:00:00    4325.4234     58.2805       9177.185
## 2: 2018-01-01 00:00:36    5254.6392    100.2430       9785.184
## 3: 2018-01-01 00:01:12    4565.8940    146.6556       8247.590
## 4: 2018-01-01 00:01:49    5316.7474    100.3676       9026.618
## 5: 2018-01-01 00:02:25    2738.8858     87.7780       9327.280
## 6: 2018-01-01 00:03:01     998.2307    137.3255       9531.084
##    engine_speed_times_engine_temp   temp10 engine_speed_squared
## 1:                       252087.8  582.805           18709287.4
## 2:                       526740.7 1002.430           27611232.7
## 3:                       669614.1 1466.556           20847387.7
## 4:                       533629.0 1003.676           28267802.9
## 5:                       240413.9  877.780            7501495.2
## 6:                       137082.5 1373.255             996464.6
##    engine_temp_squared system_voltage_squared
## 1:            3396.617               84220718
## 2:           10048.654               95749817
## 3:           21507.876               68022738
## 4:           10073.648               81479833
## 5:            7704.976               86998160
## 6:           18858.291               90841562
# Change names
setnames(diagnosticDT, old = c("timestamp"), new = "obs_time")

# Tag all the numeric columns with "_NUMERIC"
tag_numeric_cols <- function(DT, cols){
    setnames(DT, old = cols, new = paste0(cols, "_NUMERIC"))
}

# Tag numeric columns
tag_numeric_cols(diagnosticDT, c("engine_speed", "engine_temp", "system_voltage"))


diagnosticDT <- as.data.table(feather::read_feather("./RInputFiles/Feather_Data/diagnosticDT.feather"))
str(diagnosticDT)
## Classes 'data.table' and 'data.frame':   100 obs. of  4 variables:
##  $ timestamp     : POSIXct, format: "2018-01-01 00:00:00" "2018-01-01 00:00:36" ...
##  $ engine_speed  : num  4325 5255 4566 5317 2739 ...
##  $ engine_temp   : num  58.3 100.2 146.7 100.4 87.8 ...
##  $ system_voltage: num  9177 9785 8248 9027 9327 ...
##  - attr(*, ".internal.selfref")=<externalptr>
diagnosticDT2 <- as.data.table(feather::read_feather("./RInputFiles/Feather_Data/diagnosticDT.feather"))
str(diagnosticDT2)
## Classes 'data.table' and 'data.frame':   100 obs. of  4 variables:
##  $ timestamp     : POSIXct, format: "2018-01-01 00:00:00" "2018-01-01 00:00:36" ...
##  $ engine_speed  : num  4325 5255 4566 5317 2739 ...
##  $ engine_temp   : num  58.3 100.2 146.7 100.4 87.8 ...
##  $ system_voltage: num  9177 9785 8248 9027 9327 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Mean of engine temp
diagnosticDT[, mean(engine_temp)]
## [1] 94.49417
# Correlation between engine_temp and system_voltage
diagnosticDT[, cor(engine_temp, system_voltage)]
## [1] 0.06351799
# Get classes of column names
correlations <- function(DT){
    # Find numeric columns
    num_cols <- diagnosticDT[, sapply(.SD, is.numeric)]
    numeric_cols <- names(diagnosticDT)[num_cols]
    return(DT[, cor(.SD), .SDcols = numeric_cols])
}


# Mean of system voltage
diagnosticDT[, lapply(.SD, mean), .SDcols = c("system_voltage")]
##    system_voltage
## 1:        9030.28
# Mean of all engine cols
engine_cols <- c("engine_speed", "engine_temp")
meanDT <- diagnosticDT[, lapply(.SD, mean), .SDcols = engine_cols]
print(meanDT)
##    engine_speed engine_temp
## 1:     3056.395    94.49417
get_numeric_cols <- function(DT){
    num_cols <- DT[, sapply(.SD, is.numeric)]
    return(names(DT)[num_cols])
}

# Function to get correlation matrix from a data.table
corrmat_from_dt <- function(DT){
    numeric_cols <- get_numeric_cols(DT)
    return(DT[, cor(.SD), .SDcols = numeric_cols])
}

# Get correlation matrices
corrmat_from_dt(diagnosticDT)
##                engine_speed  engine_temp system_voltage
## engine_speed   1.0000000000 0.0001975067     0.16182688
## engine_temp    0.0001975067 1.0000000000     0.06351799
## system_voltage 0.1618268766 0.0635179886     1.00000000
corrmat_from_dt(diagnosticDT2)
##                engine_speed  engine_temp system_voltage
## engine_speed   1.0000000000 0.0001975067     0.16182688
## engine_temp    0.0001975067 1.0000000000     0.06351799
## system_voltage 0.1618268766 0.0635179886     1.00000000

Chapter 2 - Getting Time Series Data into data.table

Overview of the POSIXct Type:

  • The name POSIX means “Portable Operating System for Unix”
    • POSIXlt = a list object with date-time components like year and day stored in individual attributes
    • POSIXct = a signed integer representing seconds since 1970-01-01, with a single attribute capturing timezone
    • POSIXct is generally preferred in databases, since it is easiest to sort and manipulate
    • This course always uses UTC (constant time zone)
  • Can convert other formats to POSIXct
    • as.POSIXct(“2004-10-27”, tz = “UTC”)
    • as.POSIXct(1540153601, origin = “1970-01-01”, tz = “UTC”) # if the integer is number of seconds since 1970-01-01
    • as.POSIXct(as.Date(42885, origin = “1900-01-01”), tz = “UTC”) # if the integer is number of days since 1970-01-01
  • The as.POSIXct is vectorized
    • dates <- c(“2004-10-24”, “2004-10-25”, “2004-10-26”)
    • as.integer(as.POSIXct(dates, tz = “UTC”))
    • someDT <- data.table(dates = c(“2004-10-24”, “2004-10-25”, “2004-10-26”))
    • someDT[, posix := as.POSIXct(dates, tz = “UTC”)]
  • Example for converting columns in a dataset without overwriting the original data
    • gameDT <- data.table( game_date = c(“2004-10-23”, “2004-10-24”, “2004-10-26”, “2004-10-27”) )
    • gameDT[, posix_date := as.POSIXct(game_date, tz = “UTC”)]
  • The lubridate family of functions can be applied also
    • the_date <- “10-27-2004 22:29:00”
    • lubridate::mdy_hms(the_date)

Creating data.tables from vectors:

  • Can create data.tables from vectors sinmilar to the process for data.frames
    • candyDT <- data.table( color = c(“red”, “blue”, “green”), size = c(“S”, “L”, “S”), num = c(100, 50, 210) )
    • testDT <- data.table( rand_numbers = rnorm(100), rand_strings = sample(LETTERS, n = 100, replace = TRUE), simple_index = 1:100, sample_dates = seq.POSIXt( from = as.POSIXct(“1990-01-01”), to = as.POSIXct(“1992-08-01”), length.out = 100), fifty_fifty_split = c(rep(TRUE, 50), rep(FALSE, 50)) )
  • The seq.POSIXt() is an extension of seq() for time data
    • start <- as.POSIXct(“2010-06-17”, tz = “UTC”)
    • end <- as.POSIXct(“2010-06-18”, tz = “UTC”)
    • hourlyDT <- data.table( timestamp = seq.POSIXt(start, end, length.out = 1 + 24) )
    • minuteDT <- data.table( timestamp = seq.POSIXt(start, end, length.out = 1 + 24 * 60) )
  • Can use the .N to dynamically size the input vector
    • add_stock_data <- function(DT){ DT[, COMPANY1 := rnorm(n = .N)] DT[, COMPANY2 := rnorm(n = .N)] }

Coercing from xts:

  • The xts format is popular, and can be used for conversions to data.table
    • dates <- seq.POSIXt(from = as.POSIXct(“2017-06-15”), to = as.POSIXct(“2017-06-16”), length.out = 24)
    • ex_tee_ess<- xts::xts( x = rnorm(24), order.by = dates )
  • The xts object has several attributes
    • tclass = R class for the date-time index
    • tzone = timezone for date-time index
    • attr(ex_tee_ess, “tclass”)
    • attr(ex_tee_ess, “tzone”)
  • The expressive subsetting is one of the reasons for xts popularity
    • [‘/’] = “the whole dataset”
    • [‘2017’] = “data from 2017”
    • [‘2017-01/’] = “data from January 2017 to the end of the data”
    • [‘2014/2015’] = “data from 2014 to 2015”
  • Can also use functions for converting the units of time
    • to.minutes(), to.minutes10(), to.daily()
    • xts::to.daily(hourlyXTS)
  • Can convert to data.table using as.data.table() - may need to modify column names to be more meaningful
    • hourlyDT <- data.table::as.data.table( hourlyXTS )
    • data.table::setnames(hourlyDT, “V1”, “stock_price”)

Combining datasets with merge() and rbindlist():

  • Merging with timestamps depends on numeric precision, since POSIXct is a number
  • Precision-safe merges can be managed using the round() function
    • secDT[, timestamp := as.POSIXct(round(as.numeric(timestamp)), origin = “1970-01-01”)]
    • milliDT[, timestamp := as.POSIXct(round(as.numeric(timestamp)), origin = “1970-01-01”)]
    • merge(secDT, milliDT, by = “timestamp”, all = TRUE)
  • May need to run down-sampling process to match up records appropriately
    • salesDT[, .(ts, year = year(ts), mday = mday(ts), hour = hour(ts))]
    • dailySalesDT[, day_int := mday(timestamp)]
    • dailyPriceDT <- hourlyPriceDT[, .(price = mean(price)), by = mday(timestamp)]
    • mergeDT <- merge( dailySalesDT, dailyPriceDT, by.x = “day_int”, by.y = “day” )
  • Can also use rbindlist to stack data.tables - but, be careful about matching data types and names
    • allDT <- rbindlist(list(DT1, DT2, DT3), fill = TRUE)

Example code includes:

excelDT <- data.table(timecol=42885:42889, sales=c(105, 92, 500, 81, 230))
stringDT <- data.table(timecol=c("2017-06-01", "2017-06-02", "2017-06-03", "2017-06-04", "2017-06-05"),
                       sales=c(105, 92, 500, 81, 230)
                       )
epochSecondsDT <- data.table(timecol=1496275200 + 24*60*60*0:4, sales=c(105, 92, 500, 81, 230))
epochMillisDT <- data.table(timecol=1000 * c(1496275200 + 24*60*60*0:4), sales=c(105, 92, 500, 81, 230))


# Create POSIXct dates from a hypothetical Excel dataset
excelDT[, posix := as.POSIXct(as.Date(timecol, origin = "1900-01-01"), tz = "UTC")]

# Convert strings to POSIXct
stringDT[, posix := as.POSIXct(timecol, tz = "UTC")]

# Convert epoch seconds to POSIXct
epochSecondsDT[, posix := as.POSIXct(timecol, tz = "UTC", origin = "1970-01-01")]

# Convert epoch milliseconds to POSIXct
epochMillisDT[, posix := as.POSIXct(timecol/1000, origin = "1970-01-01", tz="UTC")]


stringDT <- data.table(timecol1=c("2017-06-01 10", "2017-06-02 5", "2017-06-03 10", 
                                  "2017-06-04 7", "2017-06-05 9"
                                  ),
                       timecol2=c("06-01-2017 10:00:00", "06-02-2017 05:00:00", "06-03-2017 10:00:00", 
                                  "06-05-2017 07:00:00", "06-04-2017 09:00:00"
                                  ),
                       sales=c(105, 92, 500, 81, 230)
                       )
stringDT
##         timecol1            timecol2 sales
## 1: 2017-06-01 10 06-01-2017 10:00:00   105
## 2:  2017-06-02 5 06-02-2017 05:00:00    92
## 3: 2017-06-03 10 06-03-2017 10:00:00   500
## 4:  2017-06-04 7 06-05-2017 07:00:00    81
## 5:  2017-06-05 9 06-04-2017 09:00:00   230
# Convert timecol1
str(stringDT)
## Classes 'data.table' and 'data.frame':   5 obs. of  3 variables:
##  $ timecol1: chr  "2017-06-01 10" "2017-06-02 5" "2017-06-03 10" "2017-06-04 7" ...
##  $ timecol2: chr  "06-01-2017 10:00:00" "06-02-2017 05:00:00" "06-03-2017 10:00:00" "06-05-2017 07:00:00" ...
##  $ sales   : num  105 92 500 81 230
##  - attr(*, ".internal.selfref")=<externalptr>
stringDT[, posix1 := lubridate::ymd_h(timecol1)]
str(stringDT)
## Classes 'data.table' and 'data.frame':   5 obs. of  4 variables:
##  $ timecol1: chr  "2017-06-01 10" "2017-06-02 5" "2017-06-03 10" "2017-06-04 7" ...
##  $ timecol2: chr  "06-01-2017 10:00:00" "06-02-2017 05:00:00" "06-03-2017 10:00:00" "06-05-2017 07:00:00" ...
##  $ sales   : num  105 92 500 81 230
##  $ posix1  : POSIXct, format: "2017-06-01 10:00:00" "2017-06-02 05:00:00" ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Convert timecol2
str(stringDT)
## Classes 'data.table' and 'data.frame':   5 obs. of  4 variables:
##  $ timecol1: chr  "2017-06-01 10" "2017-06-02 5" "2017-06-03 10" "2017-06-04 7" ...
##  $ timecol2: chr  "06-01-2017 10:00:00" "06-02-2017 05:00:00" "06-03-2017 10:00:00" "06-05-2017 07:00:00" ...
##  $ sales   : num  105 92 500 81 230
##  $ posix1  : POSIXct, format: "2017-06-01 10:00:00" "2017-06-02 05:00:00" ...
##  - attr(*, ".internal.selfref")=<externalptr>
stringDT[, posix2 := lubridate::mdy_hms(timecol2)]
str(stringDT)
## Classes 'data.table' and 'data.frame':   5 obs. of  5 variables:
##  $ timecol1: chr  "2017-06-01 10" "2017-06-02 5" "2017-06-03 10" "2017-06-04 7" ...
##  $ timecol2: chr  "06-01-2017 10:00:00" "06-02-2017 05:00:00" "06-03-2017 10:00:00" "06-05-2017 07:00:00" ...
##  $ sales   : num  105 92 500 81 230
##  $ posix1  : POSIXct, format: "2017-06-01 10:00:00" "2017-06-02 05:00:00" ...
##  $ posix2  : POSIXct, format: "2017-06-01 10:00:00" "2017-06-02 05:00:00" ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Generate a series of dates
march_dates <- seq.POSIXt(as.POSIXct("2017-03-01", tz="UTC"), 
                          as.POSIXct("2017-03-31", tz="UTC"), 
                          length.out = 31
                          )

# Generate hourly data
hourly_times <- seq.POSIXt(as.POSIXct("2017-05-01 00:00:00", tz="UTC"), 
                           as.POSIXct("2017-05-02 00:00:00", tz="UTC"), 
                           length.out = 1 + 24
                           )

# Generate sample IoT data
iotDT <- data.table(timestamp = seq.POSIXt(as.POSIXct("2016-04-19 00:00:00", tz="UTC"), 
                                           as.POSIXct("2016-04-20 00:00:00", tz="UTC"), 
                                           length.out = 1 + 24
                                           ),
                    engine_temp = rnorm(n=1+24),
                    ambient_temp = rnorm(n=1+24)
                    )
head(iotDT)
##              timestamp  engine_temp ambient_temp
## 1: 2016-04-19 00:00:00  0.070400856    1.3290296
## 2: 2016-04-19 01:00:00 -0.692487882   -0.1275896
## 3: 2016-04-19 02:00:00  0.523005842    1.1565826
## 4: 2016-04-19 03:00:00 -1.316436702    0.4565383
## 5: 2016-04-19 04:00:00 -1.516521174    0.3326634
## 6: 2016-04-19 05:00:00 -0.006312561   -1.5452930
# Create a 500-row data.table
start_date <- "2016-01-01"
end_date <- "2018-01-01"
someDT <- data.table(timestamp = seq.POSIXt(as.POSIXct(start_date), 
                                            as.POSIXct(end_date), 
                                            length.out = 500
                                            )
                     )

# Function to add random columns
add_random_cols <- function(DT, colnames){
    for (colname in colnames){
        DT[, (colname) := rnorm(n = .N)]
    }
}

# Check out the new data.table
add_random_cols(someDT, c("copper", "chopper", "stopper"))


# Simulated data
some_data <- rnorm(100)
some_dates <- seq.POSIXt(from = as.POSIXct("2017-06-15 00:00:00Z", tz = "UTC"),
                         to = as.POSIXct("2017-06-15 01:00:00Z", tz = "UTC"),
                         length.out = 100
                         )

# Make your own 'xts' object
myXTS <- xts::xts(some_data, order.by=some_dates)

# View the timezone
print(attr(myXTS, "tzone"))
## [1] "UTC"
nickelXTS <- readRDS("./RInputFiles/nickelXTS.rds")

# All observations after 2018-01-01 00:45:00
fifteenXTS <- nickelXTS["2018-01-01 00:45:00/"]

# Check the structure
str(fifteenXTS)
## An 'xts' object on 2018-01-01 00:45:27/2018-01-01 01:00:00 containing:
##   Data: num [1:25, 1] 14764 13465 13095 14700 13332 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "nickel_price"
##   Indexed by objects of class: [POSIXct,POSIXt] TZ: UTC
##   xts Attributes:  
##  NULL
# 10-minute aggregations
tenMinuteXTS <- xts::to.minutes10(nickelXTS)
print(tenMinuteXTS)
## Warning: timezone of object (UTC) is different than current timezone ().
##                     nickelXTS.Open nickelXTS.High nickelXTS.Low
## 2018-01-01 00:09:41       14986.29       14986.29      13263.81
## 2018-01-01 00:19:23       14682.81       14700.31      13253.49
## 2018-01-01 00:29:41       13561.37       14975.33      13131.90
## 2018-01-01 00:39:23       14295.35       14956.31      13388.33
## 2018-01-01 00:49:41       13426.89       14982.86      13095.23
## 2018-01-01 00:59:23       13659.71       14828.94      13104.92
## 2018-01-01 01:00:00       14780.57       14780.57      14780.57
##                     nickelXTS.Close
## 2018-01-01 00:09:41        14642.82
## 2018-01-01 00:19:23        14070.30
## 2018-01-01 00:29:41        13879.25
## 2018-01-01 00:39:23        14935.83
## 2018-01-01 00:49:41        14023.63
## 2018-01-01 00:59:23        14828.94
## 2018-01-01 01:00:00        14780.57
# 1-minute aggregations
oneMinuteXTS <- xts::to.minutes(nickelXTS)


# Convert to a data.table
nickelDT <- as.data.table(nickelXTS)
str(nickelDT)
## Classes 'data.table' and 'data.frame':   100 obs. of  2 variables:
##  $ index       : POSIXct, format: "2018-01-01 00:00:00" "2018-01-01 00:00:36" ...
##  $ nickel_price: num  14986 14132 14353 14798 14303 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Change names
setnames(nickelDT, old="index", new="spot_price_timestamp")
print(nickelDT)
##      spot_price_timestamp nickel_price
##   1:  2018-01-01 00:00:00     14986.29
##   2:  2018-01-01 00:00:36     14131.57
##   3:  2018-01-01 00:01:12     14352.72
##   4:  2018-01-01 00:01:49     14797.59
##   5:  2018-01-01 00:02:25     14303.15
##   6:  2018-01-01 00:03:01     13692.15
##   7:  2018-01-01 00:03:38     14265.23
##   8:  2018-01-01 00:04:14     13322.13
##   9:  2018-01-01 00:04:50     14149.04
##  10:  2018-01-01 00:05:27     13263.81
##  11:  2018-01-01 00:06:03     14038.89
##  12:  2018-01-01 00:06:40     13380.40
##  13:  2018-01-01 00:07:16     13968.63
##  14:  2018-01-01 00:07:52     13671.97
##  15:  2018-01-01 00:08:29     13620.77
##  16:  2018-01-01 00:09:05     14815.78
##  17:  2018-01-01 00:09:41     14642.82
##  18:  2018-01-01 00:10:18     14682.81
##  19:  2018-01-01 00:10:54     13846.74
##  20:  2018-01-01 00:11:30     13954.12
##  21:  2018-01-01 00:12:07     13953.37
##  22:  2018-01-01 00:12:43     13785.52
##  23:  2018-01-01 00:13:20     14042.39
##  24:  2018-01-01 00:13:56     14700.31
##  25:  2018-01-01 00:14:32     13978.34
##  26:  2018-01-01 00:15:09     13940.36
##  27:  2018-01-01 00:15:45     14144.94
##  28:  2018-01-01 00:16:21     13390.14
##  29:  2018-01-01 00:16:58     13253.49
##  30:  2018-01-01 00:17:34     14510.67
##  31:  2018-01-01 00:18:10     13581.22
##  32:  2018-01-01 00:18:47     14518.28
##  33:  2018-01-01 00:19:23     14070.30
##  34:  2018-01-01 00:20:00     13561.37
##  35:  2018-01-01 00:20:36     13831.70
##  36:  2018-01-01 00:21:12     13131.90
##  37:  2018-01-01 00:21:49     13224.83
##  38:  2018-01-01 00:22:25     14394.62
##  39:  2018-01-01 00:23:01     13763.71
##  40:  2018-01-01 00:23:38     14624.69
##  41:  2018-01-01 00:24:14     14785.23
##  42:  2018-01-01 00:24:50     13140.53
##  43:  2018-01-01 00:25:27     14030.86
##  44:  2018-01-01 00:26:03     13503.92
##  45:  2018-01-01 00:26:40     14388.48
##  46:  2018-01-01 00:27:16     14975.33
##  47:  2018-01-01 00:27:52     14130.33
##  48:  2018-01-01 00:28:29     13278.99
##  49:  2018-01-01 00:29:05     14868.12
##  50:  2018-01-01 00:29:41     13879.25
##  51:  2018-01-01 00:30:18     14295.35
##  52:  2018-01-01 00:30:54     14944.84
##  53:  2018-01-01 00:31:30     14739.03
##  54:  2018-01-01 00:32:07     14859.15
##  55:  2018-01-01 00:32:43     14365.30
##  56:  2018-01-01 00:33:20     14123.30
##  57:  2018-01-01 00:33:56     13838.13
##  58:  2018-01-01 00:34:32     14772.30
##  59:  2018-01-01 00:35:09     14956.31
##  60:  2018-01-01 00:35:45     14342.06
##  61:  2018-01-01 00:36:21     14886.71
##  62:  2018-01-01 00:36:58     14384.66
##  63:  2018-01-01 00:37:34     13605.72
##  64:  2018-01-01 00:38:10     13388.33
##  65:  2018-01-01 00:38:47     13991.27
##  66:  2018-01-01 00:39:23     14935.83
##  67:  2018-01-01 00:40:00     13426.89
##  68:  2018-01-01 00:40:36     14425.18
##  69:  2018-01-01 00:41:12     14982.86
##  70:  2018-01-01 00:41:49     14979.89
##  71:  2018-01-01 00:42:25     13266.64
##  72:  2018-01-01 00:43:01     13128.31
##  73:  2018-01-01 00:43:38     13227.89
##  74:  2018-01-01 00:44:14     14874.00
##  75:  2018-01-01 00:44:50     14266.51
##  76:  2018-01-01 00:45:27     14763.78
##  77:  2018-01-01 00:46:03     13464.51
##  78:  2018-01-01 00:46:40     13095.23
##  79:  2018-01-01 00:47:16     14700.34
##  80:  2018-01-01 00:47:52     13331.94
##  81:  2018-01-01 00:48:29     13134.86
##  82:  2018-01-01 00:49:05     13367.25
##  83:  2018-01-01 00:49:41     14023.63
##  84:  2018-01-01 00:50:18     13659.71
##  85:  2018-01-01 00:50:54     13414.85
##  86:  2018-01-01 00:51:30     13193.81
##  87:  2018-01-01 00:52:07     13104.92
##  88:  2018-01-01 00:52:43     13188.39
##  89:  2018-01-01 00:53:20     14038.85
##  90:  2018-01-01 00:53:56     14149.33
##  91:  2018-01-01 00:54:32     13628.37
##  92:  2018-01-01 00:55:09     13676.95
##  93:  2018-01-01 00:55:45     14365.72
##  94:  2018-01-01 00:56:21     13696.46
##  95:  2018-01-01 00:56:58     14249.07
##  96:  2018-01-01 00:57:34     14666.98
##  97:  2018-01-01 00:58:10     14298.75
##  98:  2018-01-01 00:58:47     14419.84
##  99:  2018-01-01 00:59:23     14828.94
## 100:  2018-01-01 01:00:00     14780.57
##      spot_price_timestamp nickel_price
treasuryDT <- data.table(timestamp=as.POSIXct("2018-03-01 00:00:00", tz="UTC") + 0:4 + 0.001, 
                         treasury_10y=c(0.71, 0.8, 0.78, 0.77, 0.73)
                         )
oilDT <- data.table(timestamp=as.POSIXct("2018-03-01 00:00:00", tz="UTC") + 0:4, 
                    oil=c(44.07, 44.15, 44.14, 44.06, 44.09)
                    )

# Naive approach (merge on timestamp)
newDT <- merge(treasuryDT, oilDT, on = "timestamp")
str(newDT)
## Classes 'data.table' and 'data.frame':   0 obs. of  3 variables:
##  $ timestamp   : 'POSIXct' num(0) 
##  - attr(*, "tzone")= chr "UTC"
##  $ treasury_10y: num 
##  $ oil         : num 
##  - attr(*, ".internal.selfref")=<externalptr>
# Check out the precision
treasuryDT[, as.numeric(timestamp)]
## [1] 1519862400 1519862401 1519862402 1519862403 1519862404
oilDT[, as.numeric(timestamp)]
## [1] 1519862400 1519862401 1519862402 1519862403 1519862404
# Clean up and merge
treasuryDT[, timestamp := as.POSIXct(round(as.numeric(timestamp)), origin = "1970-01-01")]
newDT <- merge(treasuryDT, oilDT, on = "timestamp")
str(newDT)
## Classes 'data.table' and 'data.frame':   5 obs. of  3 variables:
##  $ timestamp   : POSIXct, format: "2018-02-28 18:00:00" "2018-02-28 18:00:01" ...
##  $ treasury_10y: num  0.71 0.8 0.78 0.77 0.73
##  $ oil         : num  44.1 44.1 44.1 44.1 44.1
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "timestamp"
# Add grouping indicator 
# fxDT[, yearmonth := paste0(year(timestamp), "_", month(timestamp))]
# exportDT[, yearmonth := paste0(year(timestamp), "_", month(timestamp))]

# Monthly exchange rate
# monthlyFXDT <- fxDT[, .(exch_rate = mean(exchange_rate)), by = yearmonth]

# Merge 
# merge(exportDT, monthlyFXDT, by="yearmonth")

Chapter 3 - Generating Lags, Differences, and Windowed Aggregations

Generating Lags:

  • The lag represents the value of a time series n time-periods ago (common for forecasting)
    • dailyDT[, lag15 := shift(sales, type = “lag”, n = 15)]
  • The shift capability in data.table can either lag or lead, filling with NA when the data cannot be known
    • someDT[, col1_lag1 := shift(col1, n = 1, type = “lag”)]
    • someDT[, col1_lead1 := shift(col1, n = 1, type = “lead”)]
  • Need to ensure the proper sort order prior to lag/lead; the functions are time-naïve
    • setorderv(backwardsDT, “timestamp”) # operation is run in-place
    • backwardsDT[, somenums_lag1 := shift(somenums, type = “lag”, n = 1)]
  • Can also generate lage on the fly while modeling
    • mod <- lm(sales ~ shift(sales, n = 21), data = dailyDT)
  • With long datasets, may need to add a “by” so that the lagging is done by subject
    • experimentDT[, lag1 := shift(result, type = “lag”, n = 1), by = subject_id]

Generating Growth Rates and Differences:

  • Often a goal to find a stationary time series - same mean, variance, over time
  • Typically, change in metric can make an increasing time series in to a stationary differences series
    • gdpDT[, lag1 := shift(gdp, type = “lag”, n = 1)]
    • gdpDT[, diff1 := gdp - lag1]
    • gdpDT[, diff1 := gdp - shift(gdp, type = “lag”, n = 1)] # same as above, but in a single shot
  • May instead want to capture the pace of change, such as a growth rate
    • gdpDT[, growth1 := (gdp - shift(gdp, type = “lag”, n = 1)) / shift(gdp, type = “lag”, n = 1) ]
    • gdpDT[, growth1 := (gdp / shift(gdp, type = “lag”, n = 1)) - 1 ] # algebraic simplification

Windowing with j and by:

  • Windowing is the process of evaluating a metric over a time period - e.g., heartbeats per minute
    • salesDT[, nearest_month := month(timestamp)] # create a grouping column
    • aggDT <- salesDT[, .( min = min(sales), total = sum(sales), num_obs = length(sales) ), by = nearest_month ]
  • Can also do on-the-fly grouping by passing an expression to the by-clause
    • aggDT <- malfunctionDT[, .( min = min(sales), total = sum(sales), num_obs = length(sales) ), by = month(timestamp) ]

Example code includes:

dailyDT <- data.table(timestamp=as.POSIXct("2018-08-01", tz="UTC") + lubridate::days(0:152), 
                      sales=c(483.08, 449.25, 523.6, 498.36, 448, 487.03, 502.91, 475.69, 535.39, 471.54, 494.57, 509.6, 538.43, 603.55, 560.84, 482.39, 456.42, 550.68, 526.83, 577.16, 515.7, 450.44, 522.18, 546.44, 530.86, 452.47, 498.56, 486.58, 523.58, 424.25, 587.53, 533.11, 477.74, 582.16, 449.59, 575.78, 523.92, 475.5, 556.5, 487.27, 515.98, 523.78, 528.1, 548.19, 484.26, 542.97, 540.72, 475.16, 483.19, 598.89, 419.74, 448.57, 494.05, 438.82, 460.1, 343.01, 525.2, 527.51, 461.07, 557.52, 577.24, 499.41, 431.83, 487.61, 412.54, 454.56, 471.44, 520.61, 519.03, 547.59, 541.78, 507.67, 448.96, 468.08, 494.02, 520.78, 442.87, 507.98, 553.78, 486.46, 476.9, 546.92, 502.69, 557.93, 445.11, 501.94, 491.04, 534.49, 533.16, 543.76, 484.38, 610.28, 528.18, 483.56, 509.4, 496.62, 439.98, 488.11, 475.01, 514.07, 567.83, 506.74, 496.28, 417.83, 499.35, 556.75, 511, 596.06, 537.87, 562.97, 496.55, 499.85, 460.23, 478.96, 451.44, 576.34, 466.04, 433.66, 530.26, 554.76, 469.11, 477.79, 542.85, 582.55, 464.29, 458.92, 585.33, 487.18, 576.68, 488.35, 441.12, 509.81, 464.99, 464, 506.53, 459.36, 554.13, 444, 436.21, 528.15, 480.87, 541.93, 496.3, 423.1, 546.8, 499.21, 543.36, 534.85, 523.89, 524.99, 522.67, 524.51, 502.6)
                      )
str(dailyDT)
## Classes 'data.table' and 'data.frame':   153 obs. of  2 variables:
##  $ timestamp: POSIXct, format: "2018-08-01" "2018-08-02" ...
##  $ sales    : num  483 449 524 498 448 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Sort by time
setorderv(dailyDT, "timestamp")

# 1-day lag
dailyDT[, sales_lag1 := shift(sales, type = "lag", n = 1)]

# 5-day lag
dailyDT[, sales_lag5 := shift(sales, type = "lag", n = 5)]


experimentDT <- data.table(day=c(1:3, 1:3), 
                           result=c(1, 3.3, 2.5, 1.1, 3.9, 3.8), 
                           subject_id=LETTERS[c(1, 1, 1, 2, 2, 2)]
                           )
experimentDT
##    day result subject_id
## 1:   1    1.0          A
## 2:   2    3.3          A
## 3:   3    2.5          A
## 4:   1    1.1          B
## 5:   2    3.9          B
## 6:   3    3.8          B
# Yesterday
experimentDT[, yesterday := shift(result, type="lag", n=1), by=subject_id]

# Two days ago
experimentDT[, two_days_ago := shift(result, type="lag", n=2), by=subject_id]

# Preview experimentDT
print(experimentDT)
##    day result subject_id yesterday two_days_ago
## 1:   1    1.0          A        NA           NA
## 2:   2    3.3          A       1.0           NA
## 3:   3    2.5          A       3.3          1.0
## 4:   1    1.1          B        NA           NA
## 5:   2    3.9          B       1.1           NA
## 6:   3    3.8          B       3.9          1.1
aluminumDT <- as.data.table(feather::read_feather("./RInputFiles/Feather_Data/aluminumDF.feather") %>%
                                rename(timestamp=Date, price=`Cash Buyer`) %>%
                                select(timestamp, price)
                            )
str(aluminumDT)
## Classes 'data.table' and 'data.frame':   1552 obs. of  2 variables:
##  $ timestamp: Date, format: "2018-03-12" "2018-03-09" ...
##  $ price    : num  2096 2078 2082 2112 2136 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Add 1-period and 2-period lags
aluminumDT[, lag1 := shift(price, type = "lag", n = 1)]
aluminumDT[, lag2 := shift(price, type = "lag", n = 2)]

# Fit models with 1 and 2 lags
mod1 <- lm(price ~ lag1, data = aluminumDT)
mod2 <- lm(price ~ lag1 + lag2, data = aluminumDT)

# Compare
stargazer::stargazer(list(mod1, mod2), type = "text")
## 
## ==============================================================================
##                                        Dependent variable:                    
##                     ----------------------------------------------------------
##                                               price                           
##                                  (1)                          (2)             
## ------------------------------------------------------------------------------
## lag1                          0.994***                      1.013***          
##                                (0.003)                      (0.025)           
##                                                                               
## lag2                                                         -0.019           
##                                                             (0.025)           
##                                                                               
## Constant                      10.691**                      10.753**          
##                                (4.934)                      (4.944)           
##                                                                               
## ------------------------------------------------------------------------------
## Observations                    1,551                        1,550            
## R2                              0.989                        0.989            
## Adjusted R2                     0.989                        0.989            
## Residual Std. Error      20.854 (df = 1549)            20.860 (df = 1547)     
## F Statistic         138,844.900*** (df = 1; 1549) 69,320.790*** (df = 2; 1547)
## ==============================================================================
## Note:                                              *p<0.1; **p<0.05; ***p<0.01
# One-period lag
dailyDT[, sales_lag1 := shift(sales, type = "lag", n = 1)]

# One-period diff
dailyDT[, sales_diff1 := sales - sales_lag1]

# Two-period diff
dailyDT[, sales_diff2 := sales - shift(sales, type="lag", n=2)]


# Add 1-period percentage change
dailyDT[, sales_pctchng1 := sales_diff1 / sales_lag1]

# Add 2-period percentage change
dailyDT[, sales_pctchng2 := (sales / shift(sales, type="lag", n=2) - 1)]


passengerDT <- data.table(obs_time=lubridate::ymd_hms("2017-08-01 00:00:00") + lubridate::minutes(15*0:96), 
                          passengers=c(506, 513, 554, 427, 439, 476, 509, 382, 457, 498, 398, 385, 529, 442, 393, 500, 557, 439, 453, 488, 520, 546, 542, 492, 528, 493, 498, 530, 515, 537, 535, 518, 396, 623, 499, 467, 523, 499, 535, 383, 546, 552, 436, 556, 452, 512, 514, 476, 437, 432, 522, 492, 537, 480, 543, 485, 491, 512, 555, 498, 452, 502, 514, 452, 446, 458, 538, 414, 499, 433, 503, 466, 553, 473, 473, 546, 447, 545, 492, 554, 466, 618, 530, 568, 541, 433, 524, 433, 571, 506, 485, 466, 490, 467, 528, 427, 480)
                          )
str(passengerDT)
## Classes 'data.table' and 'data.frame':   97 obs. of  2 variables:
##  $ obs_time  : POSIXct, format: "2017-08-01 00:00:00" "2017-08-01 00:15:00" ...
##  $ passengers: num  506 513 554 427 439 476 509 382 457 498 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Generation time in seconds
passengerDT[, obs_time_in_seconds := as.numeric(obs_time)]

# Add floor time for each time stamp
seconds_in_an_hour <- 60 * 60
passengerDT[, hour_end := floor(obs_time_in_seconds / seconds_in_an_hour)]

# Count number of observations in each hour
passengerDT[, .N, by = hour_end]
##     hour_end N
##  1:   417096 4
##  2:   417097 4
##  3:   417098 4
##  4:   417099 4
##  5:   417100 4
##  6:   417101 4
##  7:   417102 4
##  8:   417103 4
##  9:   417104 4
## 10:   417105 4
## 11:   417106 4
## 12:   417107 4
## 13:   417108 4
## 14:   417109 4
## 15:   417110 4
## 16:   417111 4
## 17:   417112 4
## 18:   417113 4
## 19:   417114 4
## 20:   417115 4
## 21:   417116 4
## 22:   417117 4
## 23:   417118 4
## 24:   417119 4
## 25:   417120 1
##     hour_end N
# Mean passengers per hour
passengerDT[, mean(passengers), by=hour_end]
##     hour_end     V1
##  1:   417096 500.00
##  2:   417097 451.50
##  3:   417098 434.50
##  4:   417099 466.00
##  5:   417100 484.25
##  6:   417101 525.00
##  7:   417102 512.25
##  8:   417103 526.25
##  9:   417104 496.25
## 10:   417105 485.00
## 11:   417106 522.50
## 12:   417107 488.50
## 13:   417108 470.75
## 14:   417109 511.25
## 15:   417110 514.00
## 16:   417111 480.00
## 17:   417112 464.00
## 18:   417113 475.25
## 19:   417114 511.25
## 20:   417115 509.50
## 21:   417116 545.50
## 22:   417117 482.75
## 23:   417118 507.00
## 24:   417119 478.00
## 25:   417120 480.00
##     hour_end     V1
# Cleaner names
passengerDT[, .(mean_passengers = mean(passengers)), by=hour_end]
##     hour_end mean_passengers
##  1:   417096          500.00
##  2:   417097          451.50
##  3:   417098          434.50
##  4:   417099          466.00
##  5:   417100          484.25
##  6:   417101          525.00
##  7:   417102          512.25
##  8:   417103          526.25
##  9:   417104          496.25
## 10:   417105          485.00
## 11:   417106          522.50
## 12:   417107          488.50
## 13:   417108          470.75
## 14:   417109          511.25
## 15:   417110          514.00
## 16:   417111          480.00
## 17:   417112          464.00
## 18:   417113          475.25
## 19:   417114          511.25
## 20:   417115          509.50
## 21:   417116          545.50
## 22:   417117          482.75
## 23:   417118          507.00
## 24:   417119          478.00
## 25:   417120          480.00
##     hour_end mean_passengers
# Generate hourly summary statistics
passengerDT[, .(min_passengers = min(passengers), max_passengers = max(passengers)), by=hour_end]
##     hour_end min_passengers max_passengers
##  1:   417096            427            554
##  2:   417097            382            509
##  3:   417098            385            498
##  4:   417099            393            529
##  5:   417100            439            557
##  6:   417101            492            546
##  7:   417102            493            530
##  8:   417103            515            537
##  9:   417104            396            623
## 10:   417105            383            535
## 11:   417106            436            556
## 12:   417107            452            514
## 13:   417108            432            522
## 14:   417109            480            543
## 15:   417110            491            555
## 16:   417111            452            514
## 17:   417112            414            538
## 18:   417113            433            503
## 19:   417114            473            553
## 20:   417115            447            554
## 21:   417116            466            618
## 22:   417117            433            541
## 23:   417118            466            571
## 24:   417119            427            528
## 25:   417120            480            480
##     hour_end min_passengers max_passengers

Chapter 4 - Case Study: Financial Data

Modeling Metals Prices:

  • Data sets have been pulled from quandl; can then convert to data.table, including fixing variable/column names
    • aluminumDF <- Quandl::Quandl(code = “LME/PR_AL”, start_date = “2001-12-31”, end_date = “2018-03-12”)
    • aluminumDT <- as.data.table(aluminumDF)
    • newDT <- aluminumDT[, .(obstime = Date, aluminum_price = Cash Seller & Settlement )] # The .() allows for both selecting by name and changing names
  • Can also apply functions using .()
    • newDT <- aluminumDT[, .(obstime = as.POSIXct(Date, tz = “UTC”), aluminum_price = Cash Seller & Settlement )]
  • Can also apply merges based on timestamps
    • mergedDT <- merge( x = aluminumDT, y = nickelDT, all = TRUE, by = “obstime” )
  • Can also use Reduce()
    • Reduce( f = function(x,y){paste0(x, y, “|”)}, x = c(“a”, “b”, “c”) )

Time Series Feature Engineering:

  • May want to add differences and growth rates to the data
    • gdpDT[, diff1 := gdp - shift(gdp, type = “lag”, n = 1)]
    • add_diffs <- function(DT){ DT[, diff1 := gdp - shift(gdp, type = “lag”, n = 1)] ; return(invisible(NULL)) } # note that a copy of DT is passed and so DT is edited
  • Can use parentheses for a user-defined name in the data.table
    • colname <- “abc”
    • someDT[, (colname) := rnorm(10)]
    • add_diffs <- function(DT, newcol){ DT[, (newcol) := gdp - shift(gdp, type = “lag”, n = 1)] ; return(invisible(NULL)) }
    • add_diffs(DT, “diff1”)
  • Can also use get() to have a flexible column for changing
    • colname <- “def”
    • someDT[, random_stuff := get(colname) * rnorm(10)]
    • add_diffs <- function(DT, newcol, dcol){ DT[, (newcol) := get(dcol) - shift(get(dcol), type = “lag”, n = 1)] ; return(invisible(NULL)) }
    • add_diffs <- function(DT, newcol, dcol, ndiff){ DT[, (newcol) := get(dcol) - shift(get(dcol), type = “lag”, n = ndiff)] return(invisible(NULL)) } # allows for passing the number of time periods
  • Can also extend the methodology to growth rates
    • add_growth_rates <- function(DT, newcol, dcol, ndiff){ DT[, (newcol) := (get(dcol) / shift(get(dcol), type = “lag”, n = ndiff)) - 1 ] return(invisible(NULL)) }

EDA and Model Building:

  • Feature selection is sometimes needed for modeling - for example, regressions may perform poorly with too many features
  • Can look at the correlations using data.table
    • someDT <- data.table(x = rnorm(100), y = rnorm(100), z = rnorm(100))
    • someDT[complete.cases(someDT)]
    • cor(someDT)
    • cmat <- cor(someDT[complete.cases(someDT)])
    • cmat[, “x”]
    • feat_cols <- c(“var_1”, “var_5”)
    • mod1 <- lm(target ~ ., data = trainDT[, .SD, .SDcols = feat_cols])

Wrap Up:

  • Modify data.tables by reference
  • Growth rates and differences
  • Flexible code for changes in variables

Example code includes:

copperDT <- as.data.table(feather::read_feather("./RInputFiles/Feather_Data/copperDF.feather"))
str(copperDT)
## Classes 'data.table' and 'data.frame':   1546 obs. of  13 variables:
##  $ Date                    : Date, format: "2018-03-12" "2018-03-09" ...
##  $ Cash Buyer              : num  6856 6806 6828 6872 6968 ...
##  $ Cash Seller & Settlement: num  6857 6808 6830 6873 6968 ...
##  $ 3-months Buyer          : num  6895 6838 6855 6915 7003 ...
##  $ 3-months Seller         : num  6900 6839 6860 6916 7004 ...
##  $ 15-months Buyer         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 15-months Seller        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dec 1 Buyer             : num  7020 6965 6975 7025 7100 ...
##  $ Dec 1 Seller            : num  7030 6975 6985 7035 7110 ...
##  $ Dec 2 Buyer             : num  7025 6975 6975 7020 7095 ...
##  $ Dec 2 Seller            : num  7035 6985 6985 7030 7105 ...
##  $ Dec 3 Buyer             : num  7005 6955 6955 7000 7075 ...
##  $ Dec 3 Seller            : num  7015 6965 6965 7010 7085 ...
##  - attr(*, ".internal.selfref")=<externalptr>
nickelDT <- as.data.table(feather::read_feather("./RInputFiles/Feather_Data/nickelDF.feather"))
str(nickelDT)
## Classes 'data.table' and 'data.frame':   1552 obs. of  13 variables:
##  $ Date                    : Date, format: "2018-03-12" "2018-03-09" ...
##  $ Cash Buyer              : num  13720 13335 13240 13350 13575 ...
##  $ Cash Seller & Settlement: num  13725 13345 13250 13370 13580 ...
##  $ 3-months Buyer          : num  13750 13385 13295 13400 13575 ...
##  $ 3-months Seller         : num  13800 13390 13300 13420 13580 ...
##  $ 15-months Buyer         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 15-months Seller        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dec 1 Buyer             : num  14035 13655 13570 13670 13840 ...
##  $ Dec 1 Seller            : num  14085 13705 13620 13720 13890 ...
##  $ Dec 2 Buyer             : num  14190 13810 13730 13830 14000 ...
##  $ Dec 2 Seller            : num  14240 13860 13780 13880 14050 ...
##  $ Dec 3 Buyer             : num  14315 13935 13855 13955 14125 ...
##  $ Dec 3 Seller            : num  14365 13985 13905 14005 14175 ...
##  - attr(*, ".internal.selfref")=<externalptr>
cobaltDT <- as.data.table(feather::read_feather("./RInputFiles/Feather_Data/cobaltDF.feather"))
str(cobaltDT)
## Classes 'data.table' and 'data.frame':   1551 obs. of  7 variables:
##  $ Date                    : Date, format: "2018-03-12" "2018-03-09" ...
##  $ Cash Buyer              : num  18000 17500 17000 16500 16000 15500 15500 15500 15500 15500 ...
##  $ Cash Seller & Settlement: num  18500 19500 19000 18500 18000 16000 16000 16000 16000 16000 ...
##  $ 3-months Buyer          : num  18000 17500 17000 16500 16000 15500 15500 15500 15500 15500 ...
##  $ 3-months Seller         : num  18500 19500 19000 18500 18000 16000 16000 16000 16000 16000 ...
##  $ 15-months Buyer         : num  18235 18490 17990 17500 17000 ...
##  $ 15-months Seller        : num  19235 19490 18990 18500 18000 ...
##  - attr(*, ".internal.selfref")=<externalptr>
tinDT <- as.data.table(feather::read_feather("./RInputFiles/Feather_Data/tinDF.feather"))
str(tinDT)
## Classes 'data.table' and 'data.frame':   1552 obs. of  13 variables:
##  $ Date                    : Date, format: "2018-03-12" "2018-03-09" ...
##  $ Cash Buyer              : num  21475 21325 21625 21490 21585 ...
##  $ Cash Seller & Settlement: num  21500 21375 21650 21495 21595 ...
##  $ 3-months Buyer          : num  21355 21255 21525 21335 21450 ...
##  $ 3-months Seller         : num  21360 21260 21550 21340 21455 ...
##  $ 15-months Buyer         : num  21050 20960 21240 21030 21145 ...
##  $ 15-months Seller        : num  21100 21010 21290 21080 21195 ...
##  $ Dec 1 Buyer             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dec 1 Seller            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dec 2 Buyer             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dec 2 Seller            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dec 3 Buyer             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dec 3 Seller            : num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>
aluminumDT <- as.data.table(feather::read_feather("./RInputFiles/Feather_Data/aluminumDF.feather"))
str(aluminumDT)
## Classes 'data.table' and 'data.frame':   1552 obs. of  13 variables:
##  $ Date                    : Date, format: "2018-03-12" "2018-03-09" ...
##  $ Cash Buyer              : num  2096 2078 2082 2112 2136 ...
##  $ Cash Seller & Settlement: num  2097 2078 2082 2112 2136 ...
##  $ 3-months Buyer          : num  2117 2098 2104 2132 2154 ...
##  $ 3-months Seller         : num  2118 2099 2104 2132 2155 ...
##  $ 15-months Buyer         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 15-months Seller        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dec 1 Buyer             : num  2168 2148 2150 2177 2192 ...
##  $ Dec 1 Seller            : num  2173 2153 2155 2182 2197 ...
##  $ Dec 2 Buyer             : num  2188 2168 2172 2195 2210 ...
##  $ Dec 2 Seller            : num  2193 2173 2177 2200 2215 ...
##  $ Dec 3 Buyer             : num  2208 2188 2192 2215 2230 ...
##  $ Dec 3 Seller            : num  2213 2193 2197 2220 2235 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Rename "Cash Buyer" to "copper_price"
setnames(copperDT, old="Cash Buyer", new="copper_price")
setnames(cobaltDT, old="Cash Buyer", new="cobalt_price")
setnames(tinDT, old="Cash Buyer", new="tin_price")
setnames(aluminumDT, old="Cash Buyer", new="aluminum_price")

# Convert `"Date"` to POSIXct
copperDT[, close_date := as.POSIXct(Date, tz="UTC")]
cobaltDT[, close_date := as.POSIXct(Date, tz="UTC")]
tinDT[, close_date := as.POSIXct(Date, tz="UTC")]
aluminumDT[, close_date := as.POSIXct(Date, tz="UTC")]

# Create copperDT2 with "close_date" and "copper_price"
copperDT2 <- copperDT[, .(close_date, copper_price)]
cobaltDT2 <- cobaltDT[, .(close_date, cobalt_price)]
tinDT2 <- tinDT[, .(close_date, tin_price)]
aluminumDT2 <- aluminumDT[, .(close_date, aluminum_price)]

# Create a new data.table using .() subsetting
nickelDT2 <- nickelDT[, .(
    close_date = as.POSIXct(Date, tz = "UTC"),
    nickel_price = `Cash Buyer`
)]
str(nickelDT2)
## Classes 'data.table' and 'data.frame':   1552 obs. of  2 variables:
##  $ close_date  : POSIXct, format: "2018-03-11 19:00:00" "2018-03-08 18:00:00" ...
##  $ nickel_price: num  13720 13335 13240 13350 13575 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Merge copperDT and cobaltDT with merge()
mergedDT <- merge(cobaltDT2, copperDT2, by="close_date", all=TRUE)

# Merge five tables into one
mergedDT <- Reduce(f = function(x, y) { merge(x, y, by="close_date", all=TRUE) },
                   x = list(aluminumDT2, copperDT2, cobaltDT2, nickelDT2, tinDT2)
                   )


# Function to add differences
add_diffs <- function(DT, cols, ndiff){
    for (colname in cols){
        new_name <- paste0(colname, "_diff", ndiff)
        DT[, (new_name) := get(colname) - shift(get(colname), type = "lag", n = ndiff)]
    }
}

# Add 2-period diffs
add_diffs(mergedDT, paste0(c("aluminum", "cobalt", "copper", "nickel", "tin"), "_price"), 2)


# Function to add growth rates
add_growth_rates <- function(DT, cols, ndiff){
    for (colname in cols){
        new_name <- paste0(colname, "_pctchg", ndiff)
        DT[, (new_name) := (get(colname) / shift(get(colname), type = "lag", n = ndiff)) - 1]
    }
}

# Add 1-period growth rate
add_growth_rates(mergedDT, paste0(c("aluminum", "cobalt", "copper", "nickel", "tin"), "_price"), 1)


# Function to get correlation matrix from a data.table
corrmat_from_dt <- function(DT, cols){
    # Subset to the requested columns
    subDT <- DT[, .SD, .SDcols=cols]
    subDT <- subDT[complete.cases(subDT)]
    return(cor(subDT))
}

# Get correlations of prices
corrmat_from_dt(mergedDT, paste0(c("aluminum", "cobalt", "copper", "nickel", "tin"), "_price"))
##                aluminum_price cobalt_price copper_price nickel_price
## aluminum_price      1.0000000    0.5325989    0.7538137    0.5834202
## cobalt_price        0.5325989    1.0000000    0.8101197    0.9201932
## copper_price        0.7538137    0.8101197    1.0000000    0.8442838
## nickel_price        0.5834202    0.9201932    0.8442838    1.0000000
## tin_price           0.6123461    0.6821939    0.7297888    0.6538355
##                tin_price
## aluminum_price 0.6123461
## cobalt_price   0.6821939
## copper_price   0.7297888
## nickel_price   0.6538355
## tin_price      1.0000000
# Add 1-period first differences
price_cols <- c("aluminum_price", "cobalt_price", "copper_price", "nickel_price", "tin_price")
add_diffs(DT = mergedDT, cols = price_cols, ndiff = 1)

# Rename aluminum first difference to "target"
setnames(mergedDT, "aluminum_price_diff1", "target")

# Add 1-period growth rates
add_growth_rates(DT = mergedDT, cols = price_cols, ndiff = 1)

# Correlation matrix
diff_cols <- grep("_diff", x = names(mergedDT), value = TRUE)
growth_cols <- grep("_pctchg", x = names(mergedDT), value = TRUE)
corrmat_from_dt(DT = mergedDT, cols = c(diff_cols, growth_cols, "target"))[, "target"]
##   aluminum_price_diff2     cobalt_price_diff2     copper_price_diff2 
##            0.712321999            0.005595258            0.396601022 
##     nickel_price_diff2        tin_price_diff2     cobalt_price_diff1 
##            0.323119963            0.248714962            0.001459352 
##     copper_price_diff1     nickel_price_diff1        tin_price_diff1 
##            0.579055815            0.462822212            0.377950436 
## aluminum_price_pctchg1   cobalt_price_pctchg1   copper_price_pctchg1 
##            0.993585007           -0.012106589            0.562282577 
##   nickel_price_pctchg1      tin_price_pctchg1                 target 
##            0.449428280            0.359260474            1.000000000
# Add 1-period differences
add_diffs(mergedDT, paste0(c("cobalt", "copper", "nickel", "tin"), "_price"), 1)

# Add 3-period growth rates
add_growth_rates(mergedDT, paste0(c("cobalt", "copper", "nickel", "tin"), "_price"), 3)

# Add 12-period difference in nickel price
add_diffs(mergedDT, paste0(c("nickel"), "_price"), 12)

# Top 4 difference / growth columns
top_features <- c("copper_price_diff1", "nickel_price_diff1", "tin_price_diff1", "copper_price_pctchg3")

Interactive Data Visualization with plotly in R

Chapter 1 - Introduction to plotly

What is plotly?

  • Interface to the plotly javascript library - plots can work in multiple formats
    • Active development and support community
  • Need to consider the relative values of the static graphic and the interactive graphic
  • Example of the wine plot using ggplot2
    • static <- wine %>% ggplot(aes(x = Flavanoids, y = Proline, color = Type)) + geom_point()
  • Can use ggplotly() to convert a static object to an interactive plot
    • plotly::ggplotly(static)

Univariate Graphics:

  • May want to look at distributions of a categorical variable - example using the wine dataset
    • wine %>%
    • count(Type) %>% # create a frequency table
    • plot_ly(x = ~Type, y = ~n) %>% # specify aesthetics (similar to ggplot aes() function)
    • add_bars() # add the bars trace
    • wine %>% count(Type) %>% mutate(Type = forcats::fct_reorder(Type, n, .desc = TRUE)) %>% plot_ly(x = ~Type, y = ~n) %>% add_bars() # example of using forcats for reordering
  • Can instead run a histogram using plotly
    • wine %>%
    • plot_ly(x = ~Phenols) %>% # specify aesthetics
    • add_histogram() # add the histogram trace
    • wine %>% plot_ly(x = ~Phenols) %>% add_histogram(nbinsx = 10) # adjust the number of bins to a precise number
    • wine %>% plot_ly(x = ~Phenols) %>% add_histogram(xbins = list(start = 0.8, end = 4, size = 0.25)) # provide specific bin ranges

Bivariate Graphics:

  • Can extend to looking at scatterplots, boxplots, and the like
  • Example scatterplot for exploring relationships in numeric variables
    • winequality %>% plot_ly(x = ~residual_sugar, y = ~fixed_acidity) %>% add_markers()
  • Example stacked bars
    • winequality %>% count(type, quality_label) %>% plot_ly(x = ~type, y = ~n, color = ~quality_label) %>% add_bars() %>% layout(barmode = “stack”)
  • Converting to proportional bars
    • winequality %>%
    • count(type, quality_label) %>%
    • group_by(type) %>% # group the table
    • mutate(prop = n / sum(n)) %>% # calculate the proportions
    • plot_ly(x = ~type, y = ~n, color = ~quality_label) %>%
    • add_bars() %>%
    • layout(barmode = “stack”)
  • Can also run boxplots to explore numeric distributions vs. a categorical variable
    • winequality %>% plot_ly(x = ~quality_label, y = ~alcohol) %>% add_boxplot()

Example code includes:

vgsales <- readr::read_csv("./RInputFiles/vgsales.csv")
glimpse(vgsales)


# Store the scatterplot of Critic_Score vs. NA_Sales sales in 2016
scatter <- vgsales %>%
  filter(Year == 2016) %>%
  ggplot(aes(x = NA_Sales, y = Critic_Score)) +
  geom_point(alpha = 0.3)

# Convert the scatterplot to a plotly graphic
plotly::ggplotly(scatter)


library(plotly)

# Create a histogram of Critic_Score
vgsales %>%
    filter(!is.na(Critic_Score)) %>%
    plot_ly(x = ~Critic_Score) %>%
    add_histogram()

# Create a histogram of Critic_Score with at most 25 bins
vgsales %>%
    filter(!is.na(Critic_Score)) %>%
    plot_ly(x = ~Critic_Score) %>%
    add_histogram(nbinsx = 25)

# Create a histogram with bins of width 10 between 0 and 100
vgsales %>%
    filter(!is.na(Critic_Score)) %>%
    plot_ly(x = ~Critic_Score) %>%
    add_histogram(xbins = list(start=0, end=100, size=10))


# Create a frequency for Genre
genre_table <- vgsales %>%
    count(Genre)

# Reorder the bars for Genre by n
genre_table %>%
    filter(!is.na(Genre)) %>%
    mutate(Genre = fct_reorder(Genre, n, .desc=TRUE)) %>%
    plot_ly(x = ~Genre, y = ~n) %>% 
    add_bars()


# Create a scatter plot of User_Score against Critic_Score
vgsales %>% 
    filter(!is.na(Critic_Score) & !is.na(User_Score)) %>%
    plot_ly(x=~Critic_Score, y=~User_Score) %>%
    add_markers()


# Filter out the 2016 video games
vg2016 <- vgsales %>%
    filter(Year == 2016)

# Create a stacked bar chart of Rating by Genre
vg2016 %>%
    count(Genre, Rating) %>%
    plot_ly(x = ~Genre, y = ~n, color = ~Rating) %>%
    add_bars() %>%
    layout(barmode = "stack")

# Create boxplots of Global_Sales by Genre for above data
vg2016 %>% 
  plot_ly(x=~Global_Sales, y=~Genre) %>%
  add_boxplot()

Chapter 2 - Styling and Customizing Graphics

Customize Traces:

  • Can change colors; for example, non-blue histograms
    • winequality %>% plot_ly(x = ~fixed_acidity) %>% add_histogram()
    • winequality %>% plot_ly(x = ~fixed_acidity) %>% add_histogram(color = I(“red”)) # The I() function is for as-is, since plotly otherwise assumes mapping of a color
  • Can change opacity, for example in scatterplots with over-plotting
    • winequality %>% plot_ly(x = ~residual_sugar, y = ~fixed_acidity) %>% add_markers()
    • winequality %>% plot_ly(x = ~residual_sugar, y = ~fixed_acidity) %>% add_markers(marker = list(opacity = 0.2))
  • Can change the plotting symbol from filled to open
    • winequality %>% plot_ly(x = ~residual_sugar, y = ~fixed_acidity) %>% add_markers(marker = list(symbol = “circle-open”))
  • Additional marker options include
    • opacity
    • color
    • symbol (scatter/box)
    • size (scatter)
    • width (bar/histogram)

Thoughtful Use of Color:

  • Color can be used thoughtfully to represent variables in a dataset
  • Coloring by category can make the trends more evident
    • wine %>% plot_ly(x = ~Flavanoids, y = ~Alcohol, color = ~Type) %>% add_markers()
  • Coloring can add a third quantitative variable to the plot (gradient)
    • wine %>% plot_ly(x = ~Flavanoids, y = ~Alcohol, color = ~Color) %>% add_markers()
  • Can also add the RColorBrewer palettes or use a custom color palette
    • wine %>% plot_ly(x = ~Flavanoids, y = ~Alcohol, color = ~Type) %>% add_markers(colors = “Dark2”)
    • wine %>% plot_ly(x = ~Flavanoids, y = ~Alcohol, color = ~Type) %>% add_markers(colors = c(“orange”, “black”, “skyblue”)) # note that as-is I() is not needed here, since color=~Type already requested the aesthetic

Hover Info:

  • Hover information is added in plotly automtically - coordinate pairs, but without variable names for scatter
  • Can change the defaults to hovering to make the charts more intuitive
    • wine %>%
    • count(Type) %>%
    • plot_ly(x = ~Type, y = ~n, hoverinfo = “y”) %>%
    • add_bars()
  • Can customize hover to add variable names to a scatter plot
    • wine %>% plot_ly(x = ~Flavanoids, y = ~Alcohol, hoverinfo = “text”, text = ~paste(“Flavanoids:”, Flavanoids, “
      ”, “Alcohol:”, Alcohol) ) %>% add_markers() # uses the html “
      ” and the ~means columns are mapping as aesthetics

Customizing Layout:

  • The layout() function is the workhorse for many of the desired changed to plotly - like labs() and theme() in ggplot2
  • Example for adding axis labels (by list) and axis titles(by string)
    • winequality %>% plot_ly(x = ~free_so2, y = ~total_so2) %>% add_markers(marker = list(opacity = 0.2)) %>% layout(xaxis = list(title = “Free SO2 (ppm)”), yaxis = list(title = “Total SO2 (ppm)”), title = “Does free SO2 predict total SO2 in wine?”)
    • winequality %>% plot_ly(x = ~free_so2, y = ~total_so2) %>% add_markers(marker = list(opacity = 0.2)) %>% layout(xaxis = list(title = “Free SO2 (ppm, log scale)”, type = “log”), yaxis = list(title = “Total SO2 (ppm, log scale)”, type = “log”), title = “Does free SO2 predict total SO2 in wine?”)
  • Can remove the zero lines and grids from the plot
    • winequality %>% plot_ly(x = ~free_so2, y = ~total_so2) %>% add_markers(marker = list(opacity = 0.5)) %>% layout(xaxis = list(title = “Free SO2 (ppm)”, zeroline = FALSE), yaxis = list(title = “Total SO2 (ppm)”, zeroline = FALSE, showgrid = FALSE))
  • Can make additional customizations to plotting canvas - paper_bgcolor=toRGB() and plot_bgcolor=toRBG()

Example code includes:

# Filter out the 2016 video games
vgsales2016 <- vgsales %>%
    filter(Year == 2016)
str(vgsales2016)


# Create a histogram of Critic_Score with navy bars that are 50% transparent
vgsales2016 %>%
    filter(!is.na(Critic_Score)) %>%
    plot_ly(x = ~Critic_Score) %>%
    add_histogram(color = I("navy"), opacity = 0.5)


# Change the color of the histogram using a hex code
vgsales2016 %>%
    filter(!is.na(Critic_Score)) %>%
    plot_ly(x = ~Critic_Score) %>%
    add_histogram(color=I("#111e6c"))

# Change the color of the histogram using rgb()
vgsales2016 %>%
    filter(!is.na(Critic_Score)) %>%
    plot_ly(x = ~Critic_Score) %>%
    add_histogram(marker = list(color = "rgb(17, 30, 108)"))


# Set the plotting symbol to diamond and the size to 4
vgsales2016 %>%
    filter(!is.na(Critic_Score), !is.na(User_Score)) %>%
    plot_ly(x = ~User_Score, y = ~Critic_Score) %>% 
    add_markers(marker = list(symbol="diamond", size=4))


# Use color to add Genre as a third variable
vgsales2016 %>%
    filter(!is.na(Critic_Score), !is.na(User_Score), !is.na(Genre)) %>%
    plot_ly(x=~Critic_Score, y=~User_Score, color=~Genre) %>%
    add_markers(colors="Dark2")


# Create a scatterplot of User_Score against Critic_Score coded by Rating
vgsales2016 %>%
    filter(!is.na(Critic_Score), !is.na(User_Score), !is.na(Rating)) %>%
    plot_ly(x=~Critic_Score, y=~User_Score, symbol=~Rating) %>%
    add_markers()


# Create a scatterplot of User_Score vs. Critic_Score colored by User_Count
vgsales2016 %>%
    filter(!is.na(Critic_Score), !is.na(User_Score), !is.na(User_Count)) %>%
    plot_ly(x = ~Critic_Score, y = ~User_Score, color=~User_Count) %>%
    add_markers()

# Create a scatterplot of User_Score vs. Critic_Score colored by log User_Count
vgsales2016 %>%
    filter(!is.na(Critic_Score), !is.na(User_Score), !is.na(User_Count)) %>%
    plot_ly(x = ~Critic_Score, y = ~User_Score, color=~log(User_Count)) %>%
    add_markers()


# Create a bar chart of Platform with hoverinfo only for the bar heights
vgsales2016 %>%
    filter(!is.na(Platform)) %>%
    count(Platform) %>%
    plot_ly(x=~Platform, y=~n, hoverinfo="y") %>%
    add_bars()


# Create a scatterplot of User_Score vs. Critic score
vgsales2016 %>%
    filter(!is.na(Critic_Score), !is.na(User_Score), !is.na(Name)) %>%
    # Add video game Name to the hover info text
    plot_ly(x=~Critic_Score, y=~User_Score, hoverinfo="text", text=~Name) %>% 
    add_markers()


# Format the hover info for NA_Sales, EU_Sales, and Name
vgsales2016 %>% 
    filter(!is.na(NA_Sales), !is.na(EU_Sales), !is.na(Name)) %>%
    plot_ly(x = ~NA_Sales, y = ~EU_Sales, hoverinfo = "text",
            text = ~paste("NA_Sales:", NA_Sales, "<br>", "EU_Sales:", EU_Sales, "<br>", "Name:", Name)
            ) %>%
    add_markers()


# Polish the scatterplot by transforming the x-axis and labeling both axes
vgsales2016 %>%
    filter(!is.na(Global_Sales), !is.na(Critic_Score)) %>%
    plot_ly(x = ~Global_Sales, y = ~Critic_Score) %>%
    add_markers(marker = list(opacity = 0.5)) %>%
    layout(xaxis = list(title="Global sales (millions of units)", type="log"),
           yaxis = list(title="Critic score")
           )


# Set the background color to #ebebeb and remove the vertical grid
vgsales %>%
    filter(!is.na(Year)) %>%
    group_by(Year) %>%
    summarize(Global_Sales = sum(Global_Sales, na.rm=TRUE)) %>%
    plot_ly(x = ~Year, y = ~Global_Sales) %>%
    add_lines() %>%
    layout(xaxis=list(showgrid=FALSE), paper_bgcolor="#ebebeb")

Chapter 3 - Advanced Charts

Layering Traces:

  • Layering traces allows for creating more complex charts - goal is the simplest chart that communicates the message effectively
  • Example for adding a loess smooth to a scatterplot - need to calculate the smooth, then pass it using add_lines with a y aesthetic
    • m <- loess(Alcohol ~ Flavanoids, data = wine, span = 1.5)
    • wine %>% plot_ly(x = ~Flavanoids, y = ~Alcohol) %>% add_markers() %>% add_lines(y = ~fitted(m)) %>% layout(showlegend = FALSE)
    • m2 <- lm(Alcohol ~ poly(Flavanoids, 2), data = wine)
    • wine %>% plot_ly(x = ~Flavanoids, y = ~Alcohol) %>% add_markers(showlegend = FALSE) %>% add_lines(y = ~fitted(m), name = “LOESS”) %>% add_lines(y = ~fitted(m2), name = “Polynomial”)
  • Can also use layers to compare densities and distributions
    • d1 <- filter(wine, Type == 1)
    • d2 <- filter(wine, Type == 2)
    • d3 <- filter(wine, Type == 3)
    • density1 <- density(d1$Flavanoids)
    • density2 <- density(d2$Flavanoids)
    • density3 <- density(d3$Flavanoids)
    • plot_ly(opacity = 0.5) %>%
    • add_lines(x = ~density1\(x, y = ~density1\)y, name = “Type 1”) %>%
    • add_lines(x = ~density2\(x, y = ~density2\)y, name = “Type 2”) %>%
    • add_lines(x = ~density3\(x, y = ~density3\)y, name = “Type 3”) %>%
    • layout(xaxis = list(title = ‘Flavonoids’), yaxis = list(title = ‘Density’))

Subplots:

  • A series of subplots can be a powerful way to explore the interactions of various variables - example of base code that would be better with subplots
    • vgsales2016 %>% plot_ly(x = ~Critic_Score, y = ~User_Score, color = ~Genre) %>% add_markers() # too many colors, hard to read
  • Example of creating a single subplot
    • action_df <- vgsales2016 %>% filter(Genre == “Action”)
    • action_df %>% plot_ly(x = ~Critic_Score, y = ~User_Score) %>% add_markers()
  • Example of creating two subplots
    • p1 <- action_df %>% plot_ly(x = ~Critic_Score, y = ~User_Score) %>% add_markers()
    • p2 <- vgsales2016 %>% filter(Genre == “Adventure”) %>% plot_ly(x = ~life.expectancy, y = ~happiness) %>% add_markers()
    • subplot(p1, p2, nrows = 1)
  • Can add legends to the individuals subplots, which are then used by the subplot() command
    • p1 <- plot_ly(x = ~Critic_Score, y = ~User_Score) %>% add_markers(name = ~Genre)
    • p2 <- vgsales2016 %>% filter(Genre == “Adventure”) %>% plot_ly(x = ~Critic_Score, y = ~User_Score) %>% add_markers(name = ~Genre)
    • subplot(p1, p2, nrows = 1)
  • Can create shared axis labels
    • subplot(p1, p2, nrows = 1, shareY = TRUE, shareX = TRUE)
  • Can also create facets more automatically using group_by() and do()
    • library(dplyr)
    • vgsales2016 %>%
    • group_by(region) %>%
    • do(plot = plot_ly(data = ., x = ~Critic_Score, y = ~User_Score) %>%
    • add_markers(name = ~Genre)) %>%
    • subplot(nrows = 2)

Scatterplot Matrices:

  • Can explore the pairwise relationship (pairs plot) for some or all of the variables in the dataset
    • data %>% plot_ly() %>% add_trace( type = ‘splom’, dimensions = list( list(label=‘string-1’, values=X1), list(label=‘string-2’, values=X2), . . . list(label=‘string-n’, values=Xn)) )
    • Need to pass that we want “splom” (scatterplot matrix), then give each of the variables and their names
    • wine %>% plot_ly() %>% add_trace( type = ‘splom’, dimensions = list( list(label=‘Alcohol’, values=~Alcohol), list(label=‘Flavonoids’, values=~Flavanoids), list(label=‘Color’, values=~Color) ) )
  • There is linked brushing in plotly, so changes in one panel will carry through to other panels
  • Can also add color to the scatterplot matrices
    • wine %>% plot_ly(color = ~Type) %>% add_trace( type = ‘splom’, dimensions = list( list(label=‘Alcohol’, values=~Alcohol), list(label=‘Flavonoids’, values=~Flavanoids), list(label=‘Color’, values=~Color) ) )

Binned Scatterplots:

  • With large datasets, some charts (e.g., scatterplots) will perform poorly
  • Binned scatterplots can be a solution to the overplotting problem of a large scatterplot
    • sim_data %>% plot_ly(x = ~x, y = ~y) %>% add_histogram2d()
    • sim_data %>% plot_ly(x = ~x, y = ~y) %>% add_histogram2d(nbinsx = 200, nbinsy = 100)

Example code includes:

vgsales2016 <- vgsales %>%
    mutate(User_Score = as.numeric(User_Score)) %>%
    filter(Year == 2016, !is.na(User_Score), !is.na(Critic_Score))
str(vgsales2016)


# Fit the regression model of User_Score on Critic_Score
m <- lm(User_Score ~ Critic_Score, data = vgsales2016)

# Create the scatterplot with smoother
vgsales2016 %>%
   select(User_Score, Critic_Score) %>%
   na.omit() %>%
   plot_ly(x = ~Critic_Score, y = ~User_Score) %>%
   add_markers(showlegend = FALSE) %>%
   add_lines(y = ~fitted(m))


activision <- vgsales2016 %>% filter(Publisher == "Activision")
ea <- vgsales2016 %>% filter(Publisher == "Electronic Arts")
nintendo <- vgsales2016 %>% filter(Publisher == "Nintendo")

# Compute density curves
d.a <- density(activision$Critic_Score, na.rm = TRUE)
d.e <- density(ea$Critic_Score, na.rm = TRUE)
d.n <- density(nintendo$Critic_Score, na.rm = TRUE)

# Overlay density plots
plot_ly() %>%
  add_lines(x = ~d.a$x, y = ~d.a$y, name = "Activision", fill = 'tozeroy') %>%
  add_lines(x = ~d.e$x, y = ~d.e$y, name = "Electronic Arts", fill = 'tozeroy') %>%
  add_lines(x = ~d.n$x, y = ~d.n$y, name = "Nintendo", fill = 'tozeroy') %>%
  layout(xaxis = list(title = 'Critic Score'),
         yaxis = list(title = 'Density'))


# Create a scatterplot of User_Score against Critic_Score for PS4 games
p1 <- vgsales2016 %>%
   filter(Platform == "PS4") %>%
   plot_ly(x = ~Critic_Score, y = ~User_Score) %>% 
   add_markers(name = "PS4")

# Create a scatterplot of User_Score against Critic_Score for XOne games
p2 <- vgsales2016 %>%
   filter(Platform == "XOne") %>%
   plot_ly(x = ~Critic_Score, y = ~User_Score) %>% 
   add_markers(name = "XOne")

# Create a facted scatterplot containing p1 and p2
subplot(p1, p2, nrows=2)


# Create a faceted scatterplot of User_Score vs. Critic_Score with 3 rows
vgsales2016 %>%
  group_by(Platform) %>%
  do(plot = plot_ly(data = ., x=~Critic_Score, y=~User_Score) %>% 
         add_markers(name = ~Platform)
     ) %>%
  subplot(nrows = 3, shareY = TRUE, shareX = TRUE)


# Add x-axis and y-axis labels, and a title
sp2 <- subplot(p1, p2, nrows = 2, shareX=TRUE, shareY=TRUE) %>%
    layout(title="User score vs. critic score by platform, 2016")
sp2

# Add x-axis and y-axis labels, and a title to  sp2
sp2 %>%
   layout(xaxis = list(title=""), xaxis2 = list(title="Year"), 
          yaxis = list(title="Global Sales (M units)"), yaxis2 = list(title="Global Sales (M units)")
          )


# Create a SPLOM of NA_Sales, EU_Sales, and JP_Sales
vgsales2016 %>%
  plot_ly() %>%
  add_trace(type = "splom", dimensions = list(list(label = "N. America", values = ~NA_Sales),
                                              list(label = "Europe", values = ~EU_Sales),
                                              list(label = "Japan", values = ~JP_Sales)
                                              )
            )


# Color the SPLOM of NA_Sales, EU_Sales, and JP_Sales by nintendo
vgsales2016 %>%
  mutate(nintendo = ifelse(Publisher == "Nintendo", "Nintendo", "Other")) %>%
  plot_ly(color=~nintendo) %>% 
  add_trace(type="splom", dimensions = list(list(label = "N. America", values = ~NA_Sales),
                                            list(label = "Europe", values = ~EU_Sales),
                                            list(label = "Japan", values = ~JP_Sales)
                                            )
            )


# Delete the diagonal plots in splom
splom %>%
   style(diagonal = list(visible=FALSE))

# Delete the plots in the upper half of splom
splom %>%
   style(showupperhalf=FALSE)

# Delete the plots in the lower half of splom
splom %>%
   style(showlowerhalf=FALSE)


# Create a binned scatterplot of User_Score vs. Critic_Score
vgsales %>%
  plot_ly(x=~Critic_Score, y=~User_Score) %>%
  add_histogram2d(nbinsx=50, nbinsy=50)

Chapter 4 - Case Study

Introduction to 2018 Election Data:

  • The 2018 US midterm election featured 435 House seats, 35 Senate seats, and 36 governors
  • Can use tunrout data and fundraising data from the US elections project and US FEC respectively
    • glimpse(turnout)
    • glimpse(fundraising)
  • Can also see results of key races using
    • glimpse(senate_winners)

Choropleth Maps:

  • The choropleth can be used to show key variables by geography
    • turnout %>%
    • plot_geo(locationmode = ‘USA-states’) %>%
    • add_trace(z = ~turnout, locations = ~state.abbr) %>%
    • layout(geo = list(scope = ‘usa’)) # restricts map only to USA
  • Not all world regions are available in plot_geo - currently just locationmode: “USA-states” | “ISO-3” | “country names”
  • Mapping options can be passed to the geo-layout
    • scope = “world” | “usa” | “europe” | “asia” | “africa” | “north america” | “south america”
    • projection = list(type = “mercator”) “conic conformal” | “mercator” | “robinson” | “stereographic” | and 18 more.
    • scale = 1 (Larger values = tighter zoom)
    • center = list(lat = ~c.lat, lon = ~c.lon) # Set c.lat and c.lon to center the map

From Polygons to Maps:

  • Since not all regions are available, may need to cutom-create a choropleth
  • Need to start with a boundaries dataset, such as might be available in “us_states”
    • head(us_states)
  • Can join data, after cleaning up potential merging issues
    • turnout <- turnout %>% mutate(state = tolower(state)) # make state names lowercase
    • states_map <- left_join(us_states, turnout, by = c(“region” = “state”))
  • Can then plot the data
    • states_map %>%
    • group_by(group) %>%
    • plot_ly( x=~long, y=~lat, color=~turnout2018, split=~region ) %>%
    • add_polygons(line = list(width = 0.4), showlegend = FALSE)
  • May want to further clean up the map - axes, ticks, etc.
    • state_turnout_map %>% layout( title = “2018 Voter Turnout by State”, xaxis = list(title = “”, showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE), yaxis = list(title = “”, showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE) ) %>% colorbar(title = “Turnout”)

Wrap Up:

  • Chapter 1 - Basics of Plotly and conversion of ggplot2 graphics - univariate, bivariate distributions
  • Chapter 2 - Customizing Plotly Charts - reduce over-plotting, change plotting symbols, adding color, hovering, custom layouts
  • Chapter 3 - Advanced Charts - layering traces, subplots, scatterplot matrices, binned scatterplots
  • Chapter 4 - Mapping Data - choropleths, polygons, customizing map apearances

Example code includes:

turnout <- readr::read_csv("./RInputFiles/TurnoutRates.csv")
str(turnout)


# Create a scatterplot of turnout2018 against turnout2014
p <- turnout %>%
    plot_ly(x=~turnout2014, y=~turnout2018) %>%
    add_markers() %>%
    layout(xaxis = list(title="2014 voter turnout"),
           yaxis = list(title="2018 voter turnout")
           )

p


# Add the line y = x to the scatterplot
p %>%
  add_lines(x = c(0.25, 0.6), y = c(0.25, 0.6)) %>%
  layout(showlegend=FALSE)


# Create a dotplot of voter turnout in 2018 by state ordered by turnout
turnout %>%
    top_n(15, wt = turnout2018) %>%
    plot_ly(x = ~turnout2018, y = ~fct_reorder(state, turnout2018)) %>%
    add_markers() %>%
    layout(yaxis=list(title="State", type="category"), xaxis=list(title="Elgible voter turnout"))


fundraising <- readr::read_csv("./RInputFiles/fec_candidate_summary_2018.csv")
str(fundraising)


# Create a histogram of receipts for the senate races
fundraising %>%
    filter(office=="S") %>%
    plot_ly(x=~receipts) %>%
    add_histogram() %>%
    layout(title="Fundraising for 2018 Senate races", xaxis=list(title="Total contributions received"))


# Create a dotplot of the top 15 Senate campaigns
fundraising %>%
    filter(office == "S") %>%
    top_n(15, wt = receipts) %>%
    plot_ly(x = ~receipts, y = ~fct_reorder(state, receipts), color = ~fct_drop(party), 
            hoverinfo = "text", text = ~paste("Candidate:", name, "<br>", "Party:", party, "<br>",
                                              "Receipts:", receipts, "<br>",
                                              "Disbursements:", disbursement
                                              )
            ) %>%
    add_markers(colors = c("blue", "red")) 


# Create a choropleth map of the change in voter turnout from 2014 to 2018
turnout %>%
    mutate(change = turnout2018 - turnout2014) %>%
    plot_geo(locationmode = 'USA-states') %>%
    add_trace(z=~change, locations=~state.abbr) %>%
    layout(geo = list(scope="usa"))


senate_winners <- readr::read_csv("./RInputFiles/senate_winners.csv")
str(senate_winners)


# Create a choropleth map displaying the Senate results
senate_winners %>%
    plot_geo(locationmode = "USA-states") %>%
    add_trace(z=~as.numeric(as.factor(party)), locations=~state, 
              colors = c("dodgerblue", "mediumseagreen", "tomato"),
              hoverinfo = "text", text = ~paste("Candidate:", name, "<br>",
                                                "Party:", party, "<br>",
                                                "% vote:", round(pct.vote, 1)
                                                )
              ) %>%
    layout(geo = list(scope = 'usa')) %>% 
    hide_colorbar()


# Map President Trump's rallies in 2018
# rallies2018 %>%
#     plot_geo(locationmode = 'USA-states') %>%
#     add_markers(x=~long, y=~lat, size=~no.speakers, 
#                 hoverinfo = "text", text = ~paste(city, state, sep = ",")
#                 ) %>%
#     layout(title = "2018 Trump Rallies", geo = list(scope = "usa"))


# Customize the geo layout
g <- list(scope = 'usa', 
          showland = TRUE, landcolor = "gray90",
          showlakes = TRUE, lakecolor = "white",
          showsubunit = TRUE, subunitcolor = "white"
          )

# Apply the geo layout to the map
# rallies2018 %>%
#     plot_geo(locationmode = 'USA-states') %>%
#     add_markers(x = ~long, y = ~lat, size = ~no.speakers, 
#                 hoverinfo = "text", text = ~paste(city, state, sep = ",")
#                 ) %>%
#     layout(title = "2018 Trump Rallies", geo = list(scope="usa"))


# Customize the geo layout
g <- list(scope = 'usa', 
          showland = TRUE, landcolor = toRGB("gray90"),
          showlakes = TRUE, lakecolor = toRGB("white"),
          showsubunit = TRUE, subunitcolor = toRGB("white")
          )

# Apply the geo layout to the map
# rallies2018 %>%
#     plot_geo(locationmode = 'USA-states') %>%
#     add_markers(x = ~long, y = ~lat, size = ~no.speakers, 
#                 hoverinfo = "text", text = ~paste(city, state, sep = ",")
#                 ) %>%
#     layout(title = "2018 Trump Rallies", geo = g)


fl_boundaries <- readr::read_csv("./RInputFiles/fl_boundaries.csv")
str(fl_boundaries)
fl_results <- readr::read_csv("./RInputFiles/fl_results.csv")
str(fl_results)


# Create a choropleth map displaying the Senate winners
# senate_vote %>%
#     group_by(group) %>%
#     plot_ly(x=~long, y=~lat, color=~PartyCode, split=~region) %>%
#     add_polygons(line = list(width=0.4), showlegend=FALSE)

# Adjust the polygon colors and boundaries
# senate_map %>%
#     group_by(group) %>%
#     plot_ly(x = ~long, y = ~lat, color = ~party, split = ~region, 
#             colors=c("dodgerblue", "mediumseagreen", "tomato")
#             ) %>%
#     add_polygons(line = list(width = 0.4, color=toRGB("gray60")), showlegend = FALSE)

# Define the layout settings to polish the axes
# map_axes <- list(title = "", showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)

# Apply the layout to both axes
# senate_map %>%
#     group_by(group) %>%
#     plot_ly(x = ~long, y = ~lat, color = ~party, split = ~region, 
#             colors = c("dodgerblue", "mediumseagreen", "tomato")
#             ) %>%
#     add_polygons(line = list(width = 0.4, color = toRGB("gray60")), showlegend = FALSE) %>%
#     layout(xaxis=map_axes, yaxis=map_axes)


# Join the fl_boundaries and fl_results data frames
senate_vote <- left_join(fl_boundaries, fl_results, by = c("subregion" = "CountyName"))

# Specify the axis settings to polish the map
map_axes <- list(title = "", showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)

# Create a polished county-level choropleth map of Pctvote
senate_vote %>%
    group_by(group) %>%
    plot_ly(x = ~long, y = ~lat, color = ~Pctvote, split = ~subregion) %>%
    add_polygons(line = list(width = 0.4), showlegend = FALSE, colors = c("blue", "red")) %>%
    layout(xaxis = map_axes, yaxis = map_axes)

Hyperparameter Tuning in R

Chapter 1 - Introduction to hyperparameters

Parameters vs. Hyperparameters:

  • The hyper-parameters differ from the model parameters
    • Model parameters are being fit during training
    • Hyper-parameters are set prior to training and specify how the training should happen

Recap of machine learning basics:

  • First step for hyoer-parameter training is to split the data in to test/train subsets
    • index <- caret::createDataPartition(breast_cancer_data$diagnosis, p = .70, list = FALSE)
    • bc_train_data <- breast_cancer_data[index, ]
    • bc_test_data <- breast_cancer_data[-index, ]
  • Can then train the model using caret
    • fitControl <- caret::trainControl(method = “repeatedcv”, number = 3, repeats = 5)
    • tictoc::tic() # timing function
    • set.seed(42)
    • rf_model <- train(diagnosis ~ ., data = bc_train_data, method = “rf”, trControl = fitControl, verbose = FALSE)
    • tictoc::toc() # timing function
  • The caret package automatically attempts a few hyper-parameters and picks the best for the final results

Hyperparameter tuning in caret:

  • The caret package automatically performs some hyper-parameter tuning (e.g., tries a few values for mtry)
  • Many models are available in caret, with details available at
  • Example for running an SVM using caret
    • fitControl <- trainControl(method = “repeatedcv”, number = 3, repeats = 5)
    • tictoc::tic() # timing function
    • set.seed(42)
    • svm_model <- train(diagnosis ~ ., data = bc_train_data, method = “svmPoly”, trControl = fitControl, verbose= FALSE)
    • tictoc::toc() # timing function
  • Can specify the number of values per hyperparameter to attempt for automatic tuning
    • svm_model_2 <- train(diagnosis ~ ., data = bc_train_data, method = “svmPoly”, trControl = fitControl, verbose = FALSE, tuneLength = 5) # tuneLenght=5 means “try 5 of each”
  • Can also manually specify hyperparameters for tuning
    • hyperparams <- expand.grid(degree = 4, scale = 1, C = 1)
    • svm_model_3 <- train(diagnosis ~ ., data = bc_train_data, method = “svmPoly”, trControl = fitControl, tuneGrid = hyperparams, verbose = FALSE)

Example code includes:

breast_cancer_data <- readr::read_csv("./RInputFiles/breast_cancer_data.csv")
## Parsed with column specification:
## cols(
##   diagnosis = col_character(),
##   concavity_mean = col_double(),
##   symmetry_mean = col_double(),
##   fractal_dimension_mean = col_double(),
##   perimeter_se = col_double(),
##   smoothness_se = col_double(),
##   concavity_se = col_double(),
##   `concave points_se` = col_double(),
##   perimeter_worst = col_double(),
##   symmetry_worst = col_double(),
##   fractal_dimension_worst = col_double()
## )
str(breast_cancer_data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 100 obs. of  11 variables:
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   diagnosis = col_character(),
##   ..   concavity_mean = col_double(),
##   ..   symmetry_mean = col_double(),
##   ..   fractal_dimension_mean = col_double(),
##   ..   perimeter_se = col_double(),
##   ..   smoothness_se = col_double(),
##   ..   concavity_se = col_double(),
##   ..   `concave points_se` = col_double(),
##   ..   perimeter_worst = col_double(),
##   ..   symmetry_worst = col_double(),
##   ..   fractal_dimension_worst = col_double()
##   .. )
# bc_train_data <- readr::read_csv("./RInputFiles/bc_train_data.csv")
# str(bc_train_data)


# Fit a linear model on the breast_cancer_data.
linear_model <- lm(concavity_mean ~ symmetry_mean, data=breast_cancer_data)

# Look at the summary of the linear_model.
summary(linear_model)
## 
## Call:
## lm(formula = concavity_mean ~ symmetry_mean, data = breast_cancer_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.201877 -0.039201 -0.008432  0.030655  0.226150 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.15311    0.04086  -3.747 0.000303 ***
## symmetry_mean  1.33366    0.21257   6.274 9.57e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06412 on 98 degrees of freedom
## Multiple R-squared:  0.2866, Adjusted R-squared:  0.2793 
## F-statistic: 39.36 on 1 and 98 DF,  p-value: 9.575e-09
# Extract the coefficients.
coef(linear_model)
##   (Intercept) symmetry_mean 
##    -0.1531055     1.3336568
# Plot linear relationship.
ggplot(data = breast_cancer_data, aes(x = symmetry_mean, y = concavity_mean)) +
    geom_point(color = "grey") +
    geom_abline(slope = coef(linear_model)[2], intercept = coef(linear_model)[1])

# Create partition index
index <- caret::createDataPartition(breast_cancer_data$diagnosis, p = 0.7, list = FALSE)

# Subset `breast_cancer_data` with index
bc_train_data <- breast_cancer_data[index, ]
bc_test_data  <- breast_cancer_data[-index, ]

# Define 3x5 folds repeated cross-validation
fitControl <- caret::trainControl(method = "repeatedcv", number = 5, repeats = 3)

# Run the train() function
gbm_model <- caret::train(diagnosis ~ ., data = bc_train_data, method="gbm", 
                          trControl=fitControl, verbose = FALSE
                          )
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
## The following object is masked from 'package:purrr':
## 
##     lift
# Look at the model
gbm_model
## Stochastic Gradient Boosting 
## 
## 70 samples
## 10 predictors
##  2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 56, 56, 56, 56, 56, 56, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.8809524  0.7619048
##   1                  100      0.8857143  0.7714286
##   1                  150      0.8761905  0.7523810
##   2                   50      0.8809524  0.7619048
##   2                  100      0.8714286  0.7428571
##   2                  150      0.8761905  0.7523810
##   3                   50      0.8809524  0.7619048
##   3                  100      0.8809524  0.7619048
##   3                  150      0.8857143  0.7714286
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.
set.seed(42)  # Set seed.
tictoc::tic()  # Start timer.
gbm_model <- caret::train(diagnosis ~ ., data = bc_train_data, method = "gbm", 
                          trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
                          verbose = FALSE, tuneLength=4
                          )
tictoc::toc()  # Stop timer.
## 2.54 sec elapsed
# Define hyperparameter grid.
hyperparams <- expand.grid(n.trees = 200, interaction.depth = 1, 
                           shrinkage = 0.1, n.minobsinnode = 10
                           )

# Apply hyperparameter grid to train().
set.seed(42)
gbm_model <- caret::train(diagnosis ~ ., data = bc_train_data, method = "gbm", 
                          trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
                          verbose = FALSE, tuneGrid=hyperparams
                          )

Chapter 2 - Hyperparameter Tuning with caret

Hyperparameter tuning in caret:

  • Example for using 2016 US voter turnout dataset
    • fitControl <- trainControl(method = “repeatedcv”, number = 3, repeats = 5)
    • gbm_model_voters <- train(turnout16_2016 ~ ., data = voters_train_data, method = “gbm”, trControl = fitControl, verbose = FALSE)
    • gbm_model_voters
  • Can use a Cartesian grid for hyperparameters - increases the model run-time
    • man_grid <- expand.grid(n.trees = c(100, 200, 250), interaction.depth = c(1, 4, 6), shrinkage = 0.1, n.minobsinnode = 10)
    • fitControl <- trainControl(method = “repeatedcv”, number = 3, repeats = 5)
    • gbm_model_voters_grid <- train(turnout16_2016 ~ ., data = voters_train_data, method = “gbm”, trControl = fitControl, verbose = FALSE, tuneGrid = man_grid)
  • Can plot the hyper-parameter performance either on accuracy or Kappa
    • plot(gbm_model_voters_grid)
    • plot(gbm_model_voters_grid, metric = “Kappa”, plotType = “level”)

Grid vs. Random Search:

  • Can continue with the grid search methodology using ranges and seq()
    • big_grid <- expand.grid(n.trees = seq(from = 10, to = 300, by = 50), interaction.depth = seq(from = 1, to = 10, length.out = 6), shrinkage = 0.1, n.minobsinnode = 10)
  • Can instead run random search, looking for good parameters randomly
    • fitControl <- trainControl(method = “repeatedcv”, number = 3, repeats = 5, search = “random”)
    • gbm_model_voters_random <- train(turnout16_2016 ~ ., data = voters_train_data, method = “gbm”, trControl = fitControl, verbose = FALSE, tuneLength = 5) # tuneLength would likely need to be ~100

Adaptive Resampling:

  • Adaptive resampling is a blend between grid sampling and random sampling
    • Hyperparameter combinations are resampled with values near combinations that performed well
    • Adaptive Resampling is, therefore, faster and more efficient!
  • Can implement adaptive resampling using caret
    • min: minimum number of resamples per hyperparameter
    • alpha: confidence level for removing hyperparameters
    • method: “gls” for linear model or “BT” for Bradley-Terry (better for large number of hyper-parameters and/or models that are already close to optimal)
    • complete: if TRUE generates full resampling set
    • fitControl <- trainControl(method = “adaptive_cv”, adaptive = list(min = 2, alpha = 0.05, method = “gls”, complete = TRUE), search = “random”)
    • fitControl <- trainControl(method = “adaptive_cv”, number = 3, repeats = 3, adaptive = list(min = 2, alpha = 0.05, method = “gls”, complete = TRUE), search = “random”)
    • gbm_model_voters_adaptive <- train(turnout16_2016 ~ ., data = voters_train_data, method = “gbm”, trControl = fitControl, verbose = FALSE, tuneLength = 7) # tuneLength would generally be at least ~100

Example code includes:

tgtData <- rep(c("Did not vote", "Voted"), each=40)
vecVoteData <- c(2, 2, 3, 2, 2, 3, 3, 1, 2, 3, 4, 4, 4, 3, 1, 2, 2, 2, 3, 2, 1, 2, 3, 2, 1, 3, 3, 3, 3, 4, 2, 4, 1, 4, 3, 3, 2, 4, 2, 1, 3, 2, 2, 1, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 2, 3, 4, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 4, 2, 3, 3, 3, 3, 3, 2, 4, 1, 3, 4, 3, 3, 2, 2, 3, 3, 2, 2, 1, 2, 4, 2, 3, 2, 3, 4, 3, 2, 2, 2, 4, 1, 2, 2, 3, 2, 1, 3, 4, 2, 2, 2, 2, 4, 2, 2, 2, 4, 2, 3, 4, 1, 4, 4, 1, 3, 4, 4, 2, 2, 3, 3, 3, 2, 3, 1, 1, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 4, 2, 2, 2, 1, 4, 2, 2, 3, 3, 4, 2, 1, 1, 3, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 3, 1, 3, 2, 1, 1, 3, 2, 2, 1, 2, 2, 1, 1, 1, 3, 2, 2, 1, 1, 2, 1, 3, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 4, 2, 1, 4, 2, 2, 3, 2, 3, 2, 3, 1, 3, 2, 1, 2, 2, 3, 2, 1, 1, 1, 3, 2, 1, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 3, 3, 1, 3, 3, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 3, 1, 1, 1, 2, 2, 2, 1, 2, 2, 4, 3, 3, 4, 1, 4, 4, 1, 2, 1, 3, 4, 4, 2, 3, 1, 3, 1, 3, 1, 1, 2, 3, 1, 2, 1, 3, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 4, 1, 1, 3, 1, 2, 2, 2, 2, 3, 1, 1, 2, 3, 2, 2, 1, 3, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 4, 3, 3, 2, 1, 1, 2, 3, 3, 1, 2, 3, 2, 2, 3, 2, 3, 3, 1, 1, 2, 2, 2, 2, 2, 3, 1, 3, 2, 2, 3, 4, 3, 2, 3, 3, 2, 3, 2, 3, 2, 2, 1, 4, 1, 3, 3, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, 3, 1, 3, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 3, 2, 1, 1, 1, 1, 3, 2, 1, 1, 2, 2, 2, 1, 1, 2, 3, 1, 1, 1, 2, 1, 2, 1, 1, 1, 4, 1, 3, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 4, 2, 2, 1, 2, 2, 3, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 1, 1, 2, 3, 2, 2, 2, 3, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 3, 2, 3, 2, 2, 2, 2, 3, 3, 4, 2, 3, 2, 1, 3, 2, 2, 2, 3, 2, 3, 2, 2, 2, 3, 3, 2, 3, 2, 3, 2, 2, 2, 3, 3, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 2, 2, 2, 2, 3, 1, 2, 2, 3, 2, 2, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 2, 2)
vecVoteData <- c(vecVoteData, 2, 3, 1, 2, 2, 2, 3, 2, 4, 1, 1, 2, 2, 2, 3, 4, 3, 4, 2, 2, 3, 2, 2, 4, 1, 1, 3, 4, 2, 4, 3, 3, 2, 4, 3, 3, 2, 2, 1, 3, 3, 1, 2, 2, 1, 1, 1, 1, 3, 2, 2, 1, 2, 2, 3, 3, 2, 1, 3, 2, 3, 3, 1, 3, 1, 2, 1, 2, 3, 3, 2, 3, 3, 2, 4, 3, 1, 2, 2, 3, 1, 1, 3, 3, 2, 2, 1, 2, 3, 1, 1, 2, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 1, 3, 1, 1, 3, 3, 2, 4, 3, 3, 3, 3, 3, 3, 4, 1, 1, 1, 1, 3, 3, 1, 2, 2, 3, 1, 3, 3, 4, 2, 3, 1, 3, 2, 1, 1, 3, 3, 3, 3, 3, 2, 3, 3, 1, 2, 3, 3, 2, 2, 3, 1, 3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 2, 2, 3, 2, 4, 1, 1, 3, 3, 3, 3, 4, 3, 2, 3, 3, 1, 3, 1, 4, 1, 1, 4, 4, 4, 4, 3, 3, 4, 3, 2, 3, 4, 2, 4, 1, 1, 3, 4, 3, 1, 1, 3, 4, 4, 4, 1, 1, 3, 1, 3, 3, 1, 3, 1, 3, 3, 3, 4, 4, 3, 3, 4, 2, 2, 3, 2, 3, 4, 2, 3, 4, 3, 3, 2, 2, 1, 2, 2, 8, 2, 8, 8, 2, 2, 2, 2, 2, 1, 2, 2, 8, 8, 8, 2, 1, 2, 2, 2, 1, 2, 1, 8, 8, 2, 2, 8, 8, 1, 2, 2, 2, 8, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 8, 2, 1, 1, 1, 1, 2, 1, 1, 2, 8, 2, 1, 1, 2, 1, 2, 1, 3, 8, 3, 3, 1, 3, 2, 2, 3, 2, 1, 3, 3, 3, 3, 3, 3, 8, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 3, 3, 2, 3, 8, 2, 3, 8, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 1, 3, 1, 8, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 2, 2, 3, 3, 1, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 1, 8, 8, 3, 1, 3, 3, 3, 3, 8, 3, 3, 3, 3, 8, 3, 3, 8, 3, 1, 3, 3, 3, 2, 3, 8, 3, 3, 3, 3, 8, 2, 2, 3, 1, 1, 2, 2, 2, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 8, 2, 1, 1, 1, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 8, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 8, 2)
vecVoteData <- c(vecVoteData, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 8, 2, 1, 2, 2, 1, 2, 2, 2, 8, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 8, 8, 2, 1, 1, 8, 1, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 8, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 8, 2, 1, 1, 2, 2, 1, 1, 8, 2, 1, 1, 8, 1, 1, 1, 2, 8, 1, 1, 8, 2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 3, 1, 1, 2, 2, 1, 2, 2, 3, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3, 2, 1, 2, 1, 4, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 3, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 4, 2, 1, 1, 2, 1, 1, 3, 2, 4, 2, 2, 1, 1, 1, 2, 1, 1, 2, 4, 1, 2, 2, 3, 1, 2, 2, 4, 3, 2, 1, 1, 1, 3, 1, 4, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 4, 1, 3, 1, 1, 2, 2, 2, 3, 2, 2, 1, 1, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 2, 2, 1, 1, 3, 2, 1, 1, 1, 1, 2, 4, 2, 2, 2, 1, 2, 1, 3, 1, 2, 2, 3, 1, 2, 3, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 4, 1, 2, 3, 2, 1, 3, 1, 2, 2, 1, 3, 2, 2, 1, 2, 3, 4, 1, 1, 2, 3, 1, 1, 1, 2, 3, 4, 2, 1, 4, 2, 2, 2, 2, 1, 2, 3, 1, 1, 2, 1, 1, 4, 1, 1, 4, 1, 4, 3, 3, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 3, 4, 1, 2, 1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 2, 2, 1, 2, 2, 3, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 4, 1, 1, 2, 1, 2, 1, 1, 1, 4, 1, 1, 2, 2, 2, 3, 3, 2, 2, 1, 2, 1, 2, 2, 1, 1, 3, 1, 1, 1, 1, 1, 2, 1, 3, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 1, 3, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 3, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 3, 1, 1, 4, 1, 2, 1, 1, 1, 2)
vecVoteData <- c(vecVoteData, 1, 1, 1, 2, 3, 2, 1, 1, 2, 3, 4, 1, 1, 2, 2, 2, 4, 2, 2, 4, 3, 3, 2, 4, 4, 3, 3, 4, 2, 2, 4, 4, 4, 2, 3, 1, 4, 2, 4, 4, 3, 3, 1, 4, 3, 3, 3, 1, 3, 2, 1, 2, 1, 1, 3, 2, 4, 2, 3, 2, 2, 1, 4, 3, 3, 3, 1, 2, 3, 3, 1, 3, 4, 4, 4, 3, 3, 4, 1, 1, 2, 4, 1, 1, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 3, 1, 2, 2, 1, 3, 2, 1, 1, 1, 1, 3, 1, 1, 1, 3, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 2, 2, 3, 2, 1, 2, 2, 1, 2, 1, 2, 1, 3, 1, 2, 2, 4, 1, 1, 1, 2, 1, 1, 1, 1, 3, 2, 3, 2, 2, 1, 2, 1, 3, 2, 1, 3, 1, 1, 4, 2, 2, 1, 3, 2, 2, 3, 2, 1, 2, 2, 2, 3, 2, 2, 1, 1, 4, 1, 1, 2, 4, 2, 3, 2, 2, 2, 4, 4, 2, 1, 1, 2, 1, 2, 2, 2, 3, 3, 3, 4, 2, 4, 3, 1, 4, 3, 3, 2, 2, 2, 2, 3, 1, 2, 2, 4, 1, 1, 3, 4, 1, 1, 1, 3, 2, 4, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 4, 1, 2, 1, 4, 1, 3, 1, 3, 1, 1, 1, 3, 2, 2, 2, 1, 1, 2, 1, 1, 2, 3, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 4, 2, 1, 2, 2, 4, 2, 1, 3, 2, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 1, 3, 4, 2, 1, 1, 3, 2, 2, 1, 1, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 4, 2, 1, 1, 4, 3, 1, 3, 1, 1, 2, 4, 3, 1, 4, 1, 2, 1, 3, 1, 2, 2, 4, 1, 2, 3, 1, 4, 1, 2, 4, 1, 1, 1, 1, 1, 1, 2, 3, 1, 4, 1, 4, 4, 2, 1, 4, 1, 4, 2, 2, 4, 2, 3, 1, 4, 3, 4, 3, 1, 3, 4, 1, 1, 1, 3, 1, 1, 2, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 4, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 4, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 4, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 3, 1, 1, 2, 2, 2, 2, 1, 1, 2, 3, 3, 1, 2, 2, 4, 1, 2, 3, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 3, 2, 2, 1, 2, 1, 2, 2, 3, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 3, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 3, 2, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 3, 1, 3, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1)
vecVoteData <- c(vecVoteData, 1, 1, 2, 1, 4, 2, 2, 2, 2, 1, 2, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 2, 1, 2, 3, 1, 3, 1, 4, 1, 3, 3, 3, 2, 2, 2, 3, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 3, 1, 1, 1, 1, 4, 1, 2, 2, 4, 3, 1, 3, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 4, 2, 2, 2, 3, 1, 2, 2, 4, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 3, 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 2, 1, 3, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 3, 1, 2, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 4, 1, 3, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 4, 4, 3, 1, 4, 2, 2, 2, 3, 1, 1, 2, 1, 1, 4, 4, 1, 4, 3, 2, 1, 2, 4, 3, 3, 4, 1, 2, 1, 2, 2, 3, 3, 1, 4, 3, 3, 4, 3, 4, 1, 1, 3, 3, 1, 1, 3, 1, 1, 3, 1, 3, 3, 3, 3, 4, 4, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 1, 2, 1, 4, 1, 1, 1, 2, 1, 4, 3, 4, 4, 2, 2, 2, 3, 1, 2, 1, 2, 2, 3, 3, 2, 1, 1, 2, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2, 1, 1, 1, 2, 1, 3, 2, 1, 1, 2, 1, 1, 3, 3, 2, 2, 2, 2, 1, 3, 3, 4, 3, 3, 3, 3, 1, 3, 2, 3, 1, 1, 2, 2, 1, 3, 1, 2, 2, 1, 1, 1, 3, 1, 1, 1, 3, 4, 1, 1, 2, 3, 3, 2, 2, 2, 2, 1, 2, 1, 4, 3, 3, 1, 1, 2, 1, 1, 4, 1, 1, 4, 2, 3, 2, 3, 3, 3, 2, 2, 2, 3, 1, 1, 1, 1, 3, 1, 2, 2, 3, 4, 3, 1, 1, 3, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 3, 2, 3, 3, 3, 4, 2, 1, 3, 2, 2, 1, 2, 2, 1, 3, 1, 2, 4, 4, 2, 1, 1, 3, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 3, 1, 1, 2, 2, 1, 3, 1, 1, 4, 3, 3, 2, 4, 3, 3, 2, 2, 1, 2, 3, 2, 2, 2, 3, 1, 2, 2, 3, 2, 3, 2, 1, 4, 2, 3, 1, 1, 1, 2, 1, 1, 2, 2, 4, 1, 3, 4, 3, 4, 2, 1, 4, 2, 3, 2, 1, 2, 2, 4, 1, 2, 4, 4, 3, 3, 3, 4, 1, 2, 1, 2)
voters_train_data <- tibble(turnout16_2016=tgtData) %>% 
    bind_cols(as.data.frame(matrix(vecVoteData, nrow=80, byrow=FALSE)))
names(voters_train_data) <- c('turnout16_2016', 'RIGGED_SYSTEM_1_2016', 'RIGGED_SYSTEM_2_2016', 'RIGGED_SYSTEM_3_2016', 'RIGGED_SYSTEM_4_2016', 'RIGGED_SYSTEM_5_2016', 'RIGGED_SYSTEM_6_2016', 'track_2016', 'persfinretro_2016', 'econtrend_2016', 'Americatrend_2016', 'futuretrend_2016', 'wealth_2016', 'values_culture_2016', 'US_respect_2016', 'trustgovt_2016', 'trust_people_2016', 'helpful_people_2016', 'fair_people_2016', 'imiss_a_2016', 'imiss_b_2016', 'imiss_c_2016', 'imiss_d_2016', 'imiss_e_2016', 'imiss_f_2016', 'imiss_g_2016', 'imiss_h_2016', 'imiss_i_2016', 'imiss_k_2016', 'imiss_l_2016', 'imiss_m_2016', 'imiss_n_2016', 'imiss_o_2016', 'imiss_p_2016', 'imiss_r_2016', 'imiss_s_2016', 'imiss_t_2016', 'imiss_u_2016', 'imiss_x_2016', 'imiss_y_2016')
glimpse(voters_train_data)
## Observations: 80
## Variables: 40
## $ turnout16_2016       <chr> "Did not vote", "Did not vote", "Did not ...
## $ RIGGED_SYSTEM_1_2016 <dbl> 2, 2, 3, 2, 2, 3, 3, 1, 2, 3, 4, 4, 4, 3,...
## $ RIGGED_SYSTEM_2_2016 <dbl> 3, 3, 2, 2, 3, 3, 2, 2, 1, 2, 4, 2, 3, 2,...
## $ RIGGED_SYSTEM_3_2016 <dbl> 1, 1, 3, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2,...
## $ RIGGED_SYSTEM_4_2016 <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 3, 3,...
## $ RIGGED_SYSTEM_5_2016 <dbl> 1, 2, 2, 2, 2, 3, 1, 1, 2, 3, 2, 2, 1, 3,...
## $ RIGGED_SYSTEM_6_2016 <dbl> 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, 3, 1, 3,...
## $ track_2016           <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1,...
## $ persfinretro_2016    <dbl> 2, 2, 2, 2, 1, 2, 2, 2, 3, 2, 3, 2, 2, 2,...
## $ econtrend_2016       <dbl> 2, 2, 2, 3, 1, 2, 2, 2, 3, 2, 4, 1, 1, 2,...
## $ Americatrend_2016    <dbl> 2, 3, 1, 1, 3, 3, 2, 2, 1, 2, 3, 1, 1, 2,...
## $ futuretrend_2016     <dbl> 3, 3, 3, 4, 4, 3, 2, 2, 3, 2, 4, 1, 1, 3,...
## $ wealth_2016          <dbl> 2, 2, 1, 2, 2, 8, 2, 8, 8, 2, 2, 2, 2, 2,...
## $ values_culture_2016  <dbl> 3, 8, 3, 3, 1, 3, 2, 2, 3, 2, 1, 3, 3, 3,...
## $ US_respect_2016      <dbl> 3, 3, 3, 2, 3, 3, 2, 3, 1, 8, 8, 3, 1, 3,...
## $ trustgovt_2016       <dbl> 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3,...
## $ trust_people_2016    <dbl> 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1,...
## $ helpful_people_2016  <dbl> 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2,...
## $ fair_people_2016     <dbl> 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1,...
## $ imiss_a_2016         <dbl> 2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1,...
## $ imiss_b_2016         <dbl> 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1,...
## $ imiss_c_2016         <dbl> 3, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1,...
## $ imiss_d_2016         <dbl> 2, 2, 1, 1, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1,...
## $ imiss_e_2016         <dbl> 3, 4, 2, 1, 4, 2, 2, 2, 2, 1, 2, 3, 1, 1,...
## $ imiss_f_2016         <dbl> 2, 2, 2, 3, 3, 2, 2, 1, 2, 1, 2, 2, 1, 1,...
## $ imiss_g_2016         <dbl> 1, 2, 3, 2, 1, 1, 2, 3, 4, 1, 1, 2, 2, 2,...
## $ imiss_h_2016         <dbl> 1, 1, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1,...
## $ imiss_i_2016         <dbl> 3, 2, 3, 2, 2, 1, 2, 1, 3, 2, 1, 3, 1, 1,...
## $ imiss_k_2016         <dbl> 2, 4, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1,...
## $ imiss_l_2016         <dbl> 2, 2, 1, 1, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1,...
## $ imiss_m_2016         <dbl> 1, 1, 2, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 1,...
## $ imiss_n_2016         <dbl> 2, 2, 1, 4, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1,...
## $ imiss_o_2016         <dbl> 2, 1, 2, 2, 1, 3, 2, 1, 1, 1, 2, 1, 1, 1,...
## $ imiss_p_2016         <dbl> 4, 2, 2, 2, 2, 1, 2, 1, 2, 1, 3, 1, 1, 1,...
## $ imiss_r_2016         <dbl> 3, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 1, 1, 1,...
## $ imiss_s_2016         <dbl> 1, 1, 2, 1, 1, 2, 3, 1, 1, 1, 1, 1, 2, 1,...
## $ imiss_t_2016         <dbl> 4, 4, 3, 1, 4, 2, 2, 2, 3, 1, 1, 2, 1, 1,...
## $ imiss_u_2016         <dbl> 4, 2, 2, 2, 3, 1, 2, 1, 2, 2, 3, 3, 2, 1,...
## $ imiss_x_2016         <dbl> 2, 2, 2, 1, 2, 1, 4, 3, 3, 1, 1, 2, 1, 1,...
## $ imiss_y_2016         <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 3, 1, 1, 2, 2, 1,...
# Define Cartesian grid
man_grid <- expand.grid(degree = c(1, 2, 3), scale = c(0.1, 0.01, 0.001), C = 0.5)

fitControl <- caret::trainControl(method = "repeatedcv", number = 3, repeats = 5)

# Start timer, set seed & train model
tictoc::tic()
set.seed(42)
svm_model_voters_grid <- caret::train(turnout16_2016 ~ ., data = voters_train_data, method = "svmPoly", 
                                      trControl = fitControl, verbose= FALSE, tuneGrid = man_grid
                                      )
tictoc::toc()
## 5.94 sec elapsed
# Plot default
plot(svm_model_voters_grid)

# Plot Kappa level-plot
plot(svm_model_voters_grid, metric = "Kappa", plotType = "level")

# Define the grid with hyperparameter ranges
big_grid <- expand.grid(size = seq(from = 1, to = 5, by = 1), decay = c(0, 1))

# Train control with grid search
fitControl <- caret::trainControl(method = "repeatedcv", number = 3, repeats = 5, search = "grid")

# Train neural net
tictoc::tic()
set.seed(42)
nn_model_voters_big_grid <- caret::train(turnout16_2016 ~ ., data = voters_train_data, 
                                         method = "nnet", trControl = fitControl, verbose = FALSE
                                         )
## # weights:  42
## initial  value 37.543902 
## iter  10 value 31.027084
## iter  20 value 11.826996
## iter  30 value 11.383068
## iter  40 value 11.369768
## iter  50 value 11.368247
## iter  60 value 11.368065
## final  value 11.368020 
## converged
## # weights:  124
## initial  value 38.770476 
## iter  10 value 33.047592
## iter  20 value 33.024777
## iter  30 value 32.288188
## iter  40 value 31.181895
## iter  50 value 31.163093
## iter  60 value 30.251351
## iter  70 value 29.240243
## iter  80 value 29.237457
## iter  90 value 29.237339
## final  value 29.237329 
## converged
## # weights:  206
## initial  value 37.128261 
## iter  10 value 24.521037
## iter  20 value 9.651809
## iter  30 value 8.073319
## iter  40 value 8.054354
## iter  50 value 8.052508
## iter  60 value 8.052367
## iter  70 value 8.052246
## final  value 8.052241 
## converged
## # weights:  42
## initial  value 37.768117 
## iter  10 value 29.235462
## iter  20 value 15.314412
## iter  30 value 11.683048
## iter  40 value 11.631211
## final  value 11.631175 
## converged
## # weights:  124
## initial  value 37.897167 
## iter  10 value 22.263986
## iter  20 value 14.958578
## iter  30 value 12.341855
## iter  40 value 11.615990
## iter  50 value 10.287985
## iter  60 value 9.817803
## iter  70 value 9.445922
## iter  80 value 9.250218
## iter  90 value 9.151521
## iter 100 value 9.149065
## final  value 9.149065 
## stopped after 100 iterations
## # weights:  206
## initial  value 44.230229 
## iter  10 value 22.896820
## iter  20 value 13.420811
## iter  30 value 10.602282
## iter  40 value 9.648477
## iter  50 value 9.206581
## iter  60 value 9.107379
## iter  70 value 8.930251
## iter  80 value 8.762096
## iter  90 value 8.745108
## iter 100 value 8.733854
## final  value 8.733854 
## stopped after 100 iterations
## # weights:  42
## initial  value 38.599223 
## iter  10 value 36.013733
## iter  20 value 28.650112
## iter  30 value 25.626085
## iter  40 value 25.102655
## iter  50 value 25.099438
## iter  60 value 25.096932
## iter  70 value 25.093795
## iter  80 value 20.499773
## iter  90 value 11.863452
## iter 100 value 8.588471
## final  value 8.588471 
## stopped after 100 iterations
## # weights:  124
## initial  value 37.369941 
## iter  10 value 15.908077
## iter  20 value 11.939921
## iter  30 value 11.443831
## iter  40 value 10.998310
## iter  50 value 8.618822
## iter  60 value 8.585574
## iter  70 value 8.580485
## iter  80 value 8.573734
## iter  90 value 8.568160
## iter 100 value 8.563467
## final  value 8.563467 
## stopped after 100 iterations
## # weights:  206
## initial  value 40.450951 
## iter  10 value 30.381579
## iter  20 value 14.285982
## iter  30 value 14.254171
## iter  40 value 14.246870
## iter  50 value 11.470487
## iter  60 value 11.389585
## iter  70 value 11.077604
## iter  80 value 6.484972
## iter  90 value 6.052391
## iter 100 value 4.980032
## final  value 4.980032 
## stopped after 100 iterations
## # weights:  42
## initial  value 40.597417 
## iter  10 value 19.009959
## iter  20 value 14.390042
## iter  30 value 13.844423
## iter  40 value 13.843837
## final  value 13.843836 
## converged
## # weights:  124
## initial  value 39.052462 
## iter  10 value 19.231930
## iter  20 value 12.458882
## iter  30 value 11.315982
## iter  40 value 11.220401
## iter  50 value 11.208265
## iter  60 value 11.207881
## final  value 11.207879 
## converged
## # weights:  206
## initial  value 38.717452 
## iter  10 value 17.154775
## iter  20 value 7.114946
## iter  30 value 3.931071
## iter  40 value 3.561011
## iter  50 value 3.530411
## iter  60 value 3.508046
## iter  70 value 3.035399
## iter  80 value 2.427074
## iter  90 value 1.397239
## iter 100 value 1.390372
## final  value 1.390372 
## stopped after 100 iterations
## # weights:  42
## initial  value 40.195727 
## iter  10 value 25.333235
## iter  20 value 15.077790
## iter  30 value 14.172502
## iter  40 value 12.110285
## iter  50 value 11.963469
## final  value 11.962828 
## converged
## # weights:  124
## initial  value 38.884663 
## iter  10 value 18.872263
## iter  20 value 11.484779
## iter  30 value 10.516289
## iter  40 value 10.038859
## iter  50 value 9.832094
## iter  60 value 9.745810
## iter  70 value 9.745120
## iter  70 value 9.745120
## iter  70 value 9.745120
## final  value 9.745120 
## converged
## # weights:  206
## initial  value 50.461538 
## iter  10 value 25.068375
## iter  20 value 16.051882
## iter  30 value 11.956054
## iter  40 value 11.512066
## iter  50 value 10.468311
## iter  60 value 9.842082
## iter  70 value 9.629081
## iter  80 value 9.497557
## iter  90 value 9.260460
## iter 100 value 9.202350
## final  value 9.202350 
## stopped after 100 iterations
## # weights:  42
## initial  value 37.451965 
## iter  10 value 29.589302
## iter  20 value 28.439442
## iter  30 value 28.427382
## iter  40 value 28.253599
## iter  50 value 24.347969
## iter  60 value 20.469627
## iter  70 value 18.572164
## iter  80 value 17.333765
## iter  90 value 14.714238
## iter 100 value 13.738015
## final  value 13.738015 
## stopped after 100 iterations
## # weights:  124
## initial  value 42.709113 
## iter  10 value 11.639490
## iter  20 value 2.520854
## iter  30 value 0.267160
## iter  40 value 0.243169
## iter  50 value 0.206714
## iter  60 value 0.168213
## iter  70 value 0.133088
## iter  80 value 0.122380
## iter  90 value 0.112877
## iter 100 value 0.108669
## final  value 0.108669 
## stopped after 100 iterations
## # weights:  206
## initial  value 37.506941 
## iter  10 value 18.604849
## iter  20 value 13.069905
## iter  30 value 5.643703
## iter  40 value 0.263280
## iter  50 value 0.173518
## iter  60 value 0.149607
## iter  70 value 0.139005
## iter  80 value 0.128954
## iter  90 value 0.118010
## iter 100 value 0.109979
## final  value 0.109979 
## stopped after 100 iterations
## # weights:  42
## initial  value 36.889615 
## iter  10 value 36.727384
## final  value 36.727366 
## converged
## # weights:  124
## initial  value 51.278943 
## iter  10 value 34.993438
## iter  20 value 15.486490
## iter  30 value 8.743606
## iter  40 value 6.289637
## iter  50 value 3.910378
## iter  60 value 3.821632
## iter  70 value 3.819593
## iter  80 value 3.819228
## iter  90 value 3.819126
## iter 100 value 3.819115
## final  value 3.819115 
## stopped after 100 iterations
## # weights:  206
## initial  value 39.581972 
## iter  10 value 14.017652
## iter  20 value 9.446853
## iter  30 value 9.419187
## iter  40 value 9.418467
## final  value 9.418467 
## converged
## # weights:  42
## initial  value 37.977920 
## iter  10 value 34.764588
## iter  20 value 19.209881
## iter  30 value 11.393677
## iter  40 value 10.575455
## iter  50 value 10.573403
## final  value 10.573349 
## converged
## # weights:  124
## initial  value 39.520958 
## iter  10 value 34.772321
## iter  20 value 21.774539
## iter  30 value 11.896766
## iter  40 value 10.204592
## iter  50 value 9.469549
## iter  60 value 9.260529
## iter  70 value 9.150984
## iter  80 value 9.097132
## iter  90 value 9.050215
## iter 100 value 8.478206
## final  value 8.478206 
## stopped after 100 iterations
## # weights:  206
## initial  value 45.439219 
## iter  10 value 29.575029
## iter  20 value 13.078133
## iter  30 value 9.312289
## iter  40 value 8.551431
## iter  50 value 8.091341
## iter  60 value 7.943568
## iter  70 value 7.697956
## iter  80 value 7.591779
## iter  90 value 7.495377
## iter 100 value 7.485379
## final  value 7.485379 
## stopped after 100 iterations
## # weights:  42
## initial  value 38.828766 
## iter  10 value 12.175648
## iter  20 value 11.976102
## iter  30 value 11.973690
## iter  40 value 11.964421
## iter  50 value 9.888034
## iter  60 value 9.799411
## iter  70 value 9.795651
## iter  80 value 9.794210
## iter  90 value 9.786879
## iter 100 value 9.784080
## final  value 9.784080 
## stopped after 100 iterations
## # weights:  124
## initial  value 39.205931 
## iter  10 value 19.548574
## iter  20 value 17.282724
## iter  30 value 13.141551
## iter  40 value 12.064310
## iter  50 value 9.727721
## iter  60 value 8.105476
## iter  70 value 7.688740
## iter  80 value 7.676364
## iter  90 value 7.134801
## iter 100 value 5.638429
## final  value 5.638429 
## stopped after 100 iterations
## # weights:  206
## initial  value 37.300577 
## iter  10 value 16.083120
## iter  20 value 1.085241
## iter  30 value 0.201556
## iter  40 value 0.150128
## iter  50 value 0.135514
## iter  60 value 0.129346
## iter  70 value 0.121834
## iter  80 value 0.113025
## iter  90 value 0.110736
## iter 100 value 0.101419
## final  value 0.101419 
## stopped after 100 iterations
## # weights:  42
## initial  value 38.937284 
## final  value 37.429948 
## converged
## # weights:  124
## initial  value 41.552094 
## iter  10 value 14.662688
## iter  20 value 5.470405
## iter  30 value 4.335065
## iter  40 value 0.407672
## iter  50 value 0.036928
## iter  60 value 0.000289
## final  value 0.000075 
## converged
## # weights:  206
## initial  value 41.051035 
## iter  10 value 23.865159
## iter  20 value 14.354181
## iter  30 value 14.264236
## iter  40 value 13.462573
## iter  50 value 9.627750
## iter  60 value 8.761932
## iter  70 value 8.741738
## iter  80 value 8.432078
## iter  90 value 8.429836
## iter 100 value 8.244398
## final  value 8.244398 
## stopped after 100 iterations
## # weights:  42
## initial  value 39.387555 
## iter  10 value 17.697016
## iter  20 value 11.707385
## iter  30 value 11.217781
## iter  40 value 11.216432
## iter  40 value 11.216432
## iter  40 value 11.216432
## final  value 11.216432 
## converged
## # weights:  124
## initial  value 40.926399 
## iter  10 value 36.653859
## iter  20 value 13.671915
## iter  30 value 9.805653
## iter  40 value 8.968733
## iter  50 value 8.736463
## iter  60 value 8.643900
## iter  70 value 8.642037
## final  value 8.642037 
## converged
## # weights:  206
## initial  value 43.533859 
## iter  10 value 32.551958
## iter  20 value 17.736477
## iter  30 value 11.407724
## iter  40 value 9.654588
## iter  50 value 8.905511
## iter  60 value 8.639356
## iter  70 value 8.587082
## iter  80 value 8.516321
## iter  90 value 8.423123
## iter 100 value 8.293314
## final  value 8.293314 
## stopped after 100 iterations
## # weights:  42
## initial  value 37.499197 
## iter  10 value 23.237414
## iter  20 value 22.899398
## iter  30 value 22.897209
## iter  40 value 22.892996
## iter  50 value 22.889618
## iter  60 value 22.888157
## iter  70 value 22.887123
## iter  80 value 21.784339
## iter  90 value 14.268243
## iter 100 value 9.811080
## final  value 9.811080 
## stopped after 100 iterations
## # weights:  124
## initial  value 37.177016 
## iter  10 value 26.669062
## iter  20 value 14.047657
## iter  30 value 11.844210
## iter  40 value 10.492051
## iter  50 value 6.974132
## iter  60 value 6.875568
## iter  70 value 6.866758
## iter  80 value 6.864666
## iter  90 value 6.862890
## iter 100 value 6.859980
## final  value 6.859980 
## stopped after 100 iterations
## # weights:  206
## initial  value 47.779450 
## iter  10 value 25.407230
## iter  20 value 16.264920
## iter  30 value 14.094346
## iter  40 value 5.777783
## iter  50 value 4.420906
## iter  60 value 4.413625
## iter  70 value 4.405804
## iter  80 value 4.393388
## iter  90 value 4.384209
## iter 100 value 4.377950
## final  value 4.377950 
## stopped after 100 iterations
## # weights:  42
## initial  value 36.121674 
## iter  10 value 18.765032
## iter  20 value 13.718974
## iter  30 value 13.696090
## iter  40 value 13.695914
## final  value 13.695911 
## converged
## # weights:  124
## initial  value 36.875709 
## iter  10 value 20.945106
## iter  20 value 10.637618
## iter  30 value 7.942709
## iter  40 value 7.567564
## iter  50 value 7.564512
## iter  60 value 7.563890
## iter  70 value 7.563807
## final  value 7.563792 
## converged
## # weights:  206
## initial  value 38.105752 
## iter  10 value 17.942875
## iter  20 value 6.806510
## iter  30 value 6.497444
## iter  40 value 6.171895
## iter  50 value 5.976522
## iter  60 value 1.575089
## iter  70 value 1.418107
## iter  80 value 1.387946
## iter  90 value 1.386482
## iter 100 value 1.386386
## final  value 1.386386 
## stopped after 100 iterations
## # weights:  42
## initial  value 38.438385 
## iter  10 value 32.251851
## iter  20 value 24.110705
## iter  30 value 18.306974
## iter  40 value 12.034778
## iter  50 value 11.124601
## iter  60 value 11.117284
## final  value 11.117281 
## converged
## # weights:  124
## initial  value 38.235688 
## iter  10 value 22.223623
## iter  20 value 14.849084
## iter  30 value 12.806201
## iter  40 value 10.501168
## iter  50 value 9.341200
## iter  60 value 9.024103
## iter  70 value 8.795149
## iter  80 value 8.762934
## iter  90 value 8.683522
## iter 100 value 8.673141
## final  value 8.673141 
## stopped after 100 iterations
## # weights:  206
## initial  value 46.845663 
## iter  10 value 22.639420
## iter  20 value 11.550138
## iter  30 value 9.365575
## iter  40 value 8.662270
## iter  50 value 8.524984
## iter  60 value 8.311540
## iter  70 value 8.230576
## iter  80 value 8.203278
## iter  90 value 8.177681
## iter 100 value 8.174481
## final  value 8.174481 
## stopped after 100 iterations
## # weights:  42
## initial  value 36.424515 
## final  value 36.044322 
## converged
## # weights:  124
## initial  value 37.485456 
## iter  10 value 21.269622
## iter  20 value 14.136111
## iter  30 value 11.904381
## iter  40 value 11.393983
## iter  50 value 11.387154
## iter  60 value 11.383586
## iter  70 value 6.144184
## iter  80 value 4.321966
## iter  90 value 4.240870
## iter 100 value 4.234243
## final  value 4.234243 
## stopped after 100 iterations
## # weights:  206
## initial  value 36.805335 
## iter  10 value 15.575051
## iter  20 value 11.381900
## iter  30 value 11.368671
## iter  40 value 11.365465
## iter  50 value 11.363026
## iter  60 value 11.360019
## iter  70 value 11.357496
## iter  80 value 8.277973
## iter  90 value 5.540749
## iter 100 value 5.527754
## final  value 5.527754 
## stopped after 100 iterations
## # weights:  42
## initial  value 46.308352 
## iter  10 value 29.765087
## iter  20 value 29.353187
## final  value 29.352122 
## converged
## # weights:  124
## initial  value 37.622206 
## iter  10 value 31.921379
## iter  20 value 30.208674
## iter  30 value 26.196009
## iter  40 value 20.526759
## iter  50 value 18.869224
## iter  60 value 16.103252
## iter  70 value 9.806161
## iter  80 value 7.304875
## iter  90 value 7.280633
## iter 100 value 7.278884
## final  value 7.278884 
## stopped after 100 iterations
## # weights:  206
## initial  value 45.092888 
## iter  10 value 21.629442
## iter  20 value 1.723894
## iter  30 value 0.103299
## iter  40 value 0.015382
## iter  50 value 0.002185
## iter  60 value 0.000641
## iter  70 value 0.000324
## iter  80 value 0.000152
## final  value 0.000086 
## converged
## # weights:  42
## initial  value 39.616814 
## iter  10 value 37.050260
## iter  20 value 17.163941
## iter  30 value 13.883452
## iter  40 value 11.172521
## iter  50 value 11.153383
## final  value 11.153374 
## converged
## # weights:  124
## initial  value 39.710930 
## iter  10 value 20.886940
## iter  20 value 10.473511
## iter  30 value 9.368545
## iter  40 value 8.795251
## iter  50 value 8.619087
## iter  60 value 8.555874
## iter  70 value 8.554889
## final  value 8.554888 
## converged
## # weights:  206
## initial  value 43.203719 
## iter  10 value 20.936085
## iter  20 value 11.155461
## iter  30 value 9.174556
## iter  40 value 8.861554
## iter  50 value 8.640300
## iter  60 value 8.575960
## iter  70 value 8.557977
## iter  80 value 8.376927
## iter  90 value 8.261648
## iter 100 value 8.143604
## final  value 8.143604 
## stopped after 100 iterations
## # weights:  42
## initial  value 38.328514 
## iter  10 value 27.984087
## iter  20 value 11.544730
## iter  30 value 9.863053
## iter  40 value 9.856206
## iter  50 value 9.832277
## iter  60 value 9.161072
## iter  70 value 7.356195
## iter  80 value 7.344764
## iter  90 value 7.337085
## iter 100 value 7.330234
## final  value 7.330234 
## stopped after 100 iterations
## # weights:  124
## initial  value 40.702488 
## iter  10 value 18.257343
## iter  20 value 9.106661
## iter  30 value 9.036648
## iter  40 value 9.010428
## iter  50 value 8.999392
## iter  60 value 8.986183
## iter  70 value 7.400810
## iter  80 value 6.680697
## iter  90 value 6.670819
## iter 100 value 6.667999
## final  value 6.667999 
## stopped after 100 iterations
## # weights:  206
## initial  value 48.166884 
## iter  10 value 21.531674
## iter  20 value 18.933456
## iter  30 value 18.505510
## iter  40 value 14.353028
## iter  50 value 11.563237
## iter  60 value 11.515029
## iter  70 value 11.510734
## iter  80 value 11.503964
## iter  90 value 11.493780
## iter 100 value 11.113541
## final  value 11.113541 
## stopped after 100 iterations
## # weights:  42
## initial  value 37.070020 
## iter  10 value 36.033776
## iter  20 value 31.057842
## iter  30 value 30.645401
## final  value 30.644961 
## converged
## # weights:  124
## initial  value 36.756406 
## iter  10 value 28.161336
## iter  20 value 8.991627
## iter  30 value 4.300812
## iter  40 value 4.206734
## iter  50 value 3.754602
## iter  60 value 0.254742
## iter  70 value 0.015670
## iter  80 value 0.003142
## iter  90 value 0.000158
## final  value 0.000077 
## converged
## # weights:  206
## initial  value 37.742855 
## iter  10 value 24.910100
## iter  20 value 8.129825
## iter  30 value 0.257190
## iter  40 value 0.016538
## iter  50 value 0.003799
## iter  60 value 0.001863
## iter  70 value 0.000615
## final  value 0.000087 
## converged
## # weights:  42
## initial  value 37.399772 
## iter  10 value 30.633711
## iter  20 value 18.937725
## iter  30 value 14.866094
## iter  40 value 11.895688
## iter  50 value 11.704900
## final  value 11.704626 
## converged
## # weights:  124
## initial  value 38.484282 
## iter  10 value 20.586892
## iter  20 value 13.316343
## iter  30 value 10.856966
## iter  40 value 10.186980
## iter  50 value 9.947220
## iter  60 value 9.848237
## iter  70 value 9.640962
## iter  80 value 9.496411
## iter  90 value 9.367648
## iter 100 value 9.343362
## final  value 9.343362 
## stopped after 100 iterations
## # weights:  206
## initial  value 45.398613 
## iter  10 value 22.888823
## iter  20 value 13.960342
## iter  30 value 11.261294
## iter  40 value 10.114453
## iter  50 value 9.493566
## iter  60 value 9.237917
## iter  70 value 9.026626
## iter  80 value 8.952669
## iter  90 value 8.933259
## iter 100 value 8.932126
## final  value 8.932126 
## stopped after 100 iterations
## # weights:  42
## initial  value 37.146249 
## iter  10 value 20.968920
## iter  20 value 13.902538
## iter  30 value 9.512966
## iter  40 value 8.545659
## iter  50 value 8.530392
## iter  60 value 8.524400
## iter  70 value 8.523511
## iter  80 value 8.523200
## iter  90 value 8.522979
## iter 100 value 8.522408
## final  value 8.522408 
## stopped after 100 iterations
## # weights:  124
## initial  value 38.293741 
## iter  10 value 21.928029
## iter  20 value 16.059454
## iter  30 value 15.527439
## iter  40 value 15.029942
## iter  50 value 11.975757
## iter  60 value 9.742379
## iter  70 value 9.729056
## iter  80 value 9.311793
## iter  90 value 7.287283
## iter 100 value 7.282690
## final  value 7.282690 
## stopped after 100 iterations
## # weights:  206
## initial  value 40.289151 
## iter  10 value 18.896990
## iter  20 value 8.502690
## iter  30 value 6.255260
## iter  40 value 6.104334
## iter  50 value 5.898461
## iter  60 value 5.593734
## iter  70 value 5.371825
## iter  80 value 5.363388
## iter  90 value 5.340072
## iter 100 value 4.903913
## final  value 4.903913 
## stopped after 100 iterations
## # weights:  42
## initial  value 38.411565 
## iter  10 value 37.429997
## final  value 37.429925 
## converged
## # weights:  124
## initial  value 37.455841 
## iter  10 value 12.230705
## iter  20 value 9.757843
## iter  30 value 9.752730
## iter  40 value 9.752490
## final  value 9.752489 
## converged
## # weights:  206
## initial  value 38.027907 
## iter  10 value 4.595234
## iter  20 value 4.318395
## iter  30 value 4.314200
## final  value 4.314143 
## converged
## # weights:  42
## initial  value 41.781294 
## iter  10 value 35.372080
## iter  20 value 26.968211
## iter  30 value 18.685653
## iter  40 value 11.501890
## iter  50 value 10.863244
## iter  60 value 10.745051
## final  value 10.745017 
## converged
## # weights:  124
## initial  value 50.444725 
## iter  10 value 32.296577
## iter  20 value 22.584457
## iter  30 value 12.480851
## iter  40 value 9.517897
## iter  50 value 8.576779
## iter  60 value 8.411506
## iter  70 value 8.150315
## iter  80 value 8.094945
## iter  90 value 8.094823
## iter  90 value 8.094823
## iter  90 value 8.094823
## final  value 8.094823 
## converged
## # weights:  206
## initial  value 56.026967 
## iter  10 value 25.763204
## iter  20 value 14.963263
## iter  30 value 10.603340
## iter  40 value 10.046471
## iter  50 value 9.734191
## iter  60 value 8.564662
## iter  70 value 7.848192
## iter  80 value 7.701127
## iter  90 value 7.639896
## iter 100 value 7.615001
## final  value 7.615001 
## stopped after 100 iterations
## # weights:  42
## initial  value 44.301410 
## iter  10 value 17.172409
## iter  20 value 9.906469
## iter  30 value 9.798778
## iter  40 value 9.795227
## iter  50 value 9.792683
## iter  60 value 9.790592
## iter  70 value 9.788245
## iter  80 value 9.787303
## iter  90 value 9.786224
## iter 100 value 9.785821
## final  value 9.785821 
## stopped after 100 iterations
## # weights:  124
## initial  value 37.573923 
## iter  10 value 24.576493
## iter  20 value 17.462144
## iter  30 value 16.621938
## iter  40 value 14.335973
## iter  50 value 14.294984
## iter  60 value 14.291039
## iter  70 value 14.284848
## iter  80 value 11.561287
## iter  90 value 11.480860
## iter 100 value 11.479160
## final  value 11.479160 
## stopped after 100 iterations
## # weights:  206
## initial  value 40.573828 
## iter  10 value 24.970926
## iter  20 value 8.346592
## iter  30 value 4.631435
## iter  40 value 4.376762
## iter  50 value 4.371902
## iter  60 value 0.391736
## iter  70 value 0.127579
## iter  80 value 0.118906
## iter  90 value 0.114409
## iter 100 value 0.100293
## final  value 0.100293 
## stopped after 100 iterations
## # weights:  42
## initial  value 37.548004 
## iter  10 value 22.125877
## iter  20 value 19.523451
## iter  30 value 17.824338
## iter  40 value 17.810105
## iter  50 value 17.809630
## iter  60 value 17.809439
## final  value 17.809437 
## converged
## # weights:  124
## initial  value 40.097215 
## iter  10 value 21.435866
## iter  20 value 15.686340
## iter  30 value 15.229455
## iter  40 value 15.051408
## iter  50 value 15.012382
## iter  60 value 14.959856
## iter  70 value 14.792090
## iter  80 value 14.783589
## iter  90 value 14.619619
## iter 100 value 13.214744
## final  value 13.214744 
## stopped after 100 iterations
## # weights:  206
## initial  value 43.660127 
## iter  10 value 31.187628
## iter  20 value 14.145669
## iter  30 value 9.308963
## iter  40 value 9.275791
## iter  50 value 9.275109
## final  value 9.275106 
## converged
## # weights:  42
## initial  value 38.389152 
## iter  10 value 37.355865
## iter  20 value 19.621582
## iter  30 value 11.952660
## iter  40 value 11.360892
## iter  50 value 11.359728
## final  value 11.359728 
## converged
## # weights:  124
## initial  value 43.765794 
## iter  10 value 28.421950
## iter  20 value 18.126650
## iter  30 value 13.435832
## iter  40 value 11.024300
## iter  50 value 9.441826
## iter  60 value 9.099920
## iter  70 value 8.874159
## iter  80 value 8.793238
## iter  90 value 8.792420
## final  value 8.792420 
## converged
## # weights:  206
## initial  value 51.396361 
## iter  10 value 33.180808
## iter  20 value 15.210539
## iter  30 value 10.725412
## iter  40 value 9.203667
## iter  50 value 8.980929
## iter  60 value 8.660941
## iter  70 value 8.523127
## iter  80 value 8.465812
## iter  90 value 8.454999
## iter 100 value 8.377635
## final  value 8.377635 
## stopped after 100 iterations
## # weights:  42
## initial  value 37.803164 
## iter  10 value 24.812001
## iter  20 value 17.381439
## iter  30 value 17.343615
## iter  40 value 15.729465
## iter  50 value 15.548654
## iter  60 value 9.897766
## iter  70 value 5.345354
## iter  80 value 4.370576
## iter  90 value 4.361931
## iter 100 value 4.360404
## final  value 4.360404 
## stopped after 100 iterations
## # weights:  124
## initial  value 38.798998 
## iter  10 value 16.201607
## iter  20 value 11.496287
## iter  30 value 8.633798
## iter  40 value 5.382864
## iter  50 value 4.381763
## iter  60 value 4.377988
## iter  70 value 4.371231
## iter  80 value 1.556101
## iter  90 value 0.106136
## iter 100 value 0.094361
## final  value 0.094361 
## stopped after 100 iterations
## # weights:  206
## initial  value 38.189545 
## iter  10 value 25.465546
## iter  20 value 3.854228
## iter  30 value 0.166762
## iter  40 value 0.100195
## iter  50 value 0.095245
## iter  60 value 0.087561
## iter  70 value 0.084259
## iter  80 value 0.080594
## iter  90 value 0.079113
## iter 100 value 0.077520
## final  value 0.077520 
## stopped after 100 iterations
## # weights:  42
## initial  value 39.534061 
## iter  10 value 16.731867
## iter  20 value 13.874719
## iter  30 value 13.868775
## final  value 13.868775 
## converged
## # weights:  124
## initial  value 37.463130 
## iter  10 value 15.900457
## iter  20 value 11.521418
## iter  30 value 11.409597
## iter  40 value 11.404440
## iter  50 value 11.403604
## final  value 11.403474 
## converged
## # weights:  206
## initial  value 36.782596 
## iter  10 value 17.519912
## iter  20 value 13.876930
## iter  30 value 13.861142
## iter  40 value 11.921481
## final  value 11.920894 
## converged
## # weights:  42
## initial  value 37.490271 
## iter  10 value 33.829511
## iter  20 value 21.976338
## iter  30 value 13.302050
## iter  40 value 11.023974
## iter  50 value 11.009133
## final  value 11.009129 
## converged
## # weights:  124
## initial  value 38.662704 
## iter  10 value 31.659784
## iter  20 value 14.766463
## iter  30 value 10.003717
## iter  40 value 9.045572
## iter  50 value 8.740712
## iter  60 value 8.600255
## iter  70 value 8.479791
## iter  80 value 8.478166
## final  value 8.478163 
## converged
## # weights:  206
## initial  value 40.411684 
## iter  10 value 24.896661
## iter  20 value 16.018690
## iter  30 value 10.010429
## iter  40 value 8.807327
## iter  50 value 8.618224
## iter  60 value 8.404275
## iter  70 value 8.208228
## iter  80 value 8.122538
## iter  90 value 8.102755
## iter 100 value 8.092622
## final  value 8.092622 
## stopped after 100 iterations
## # weights:  42
## initial  value 36.822780 
## iter  10 value 17.021046
## iter  20 value 11.712270
## iter  30 value 11.444023
## iter  40 value 11.439250
## iter  50 value 11.437732
## iter  60 value 11.436725
## iter  70 value 11.436018
## iter  80 value 11.328138
## iter  90 value 8.627101
## iter 100 value 8.554635
## final  value 8.554635 
## stopped after 100 iterations
## # weights:  124
## initial  value 43.655862 
## iter  10 value 36.729205
## iter  20 value 36.727115
## iter  30 value 24.122440
## iter  40 value 14.279970
## iter  50 value 14.250465
## iter  60 value 14.237088
## iter  70 value 14.232674
## iter  80 value 14.227975
## iter  90 value 12.953622
## iter 100 value 9.737514
## final  value 9.737514 
## stopped after 100 iterations
## # weights:  206
## initial  value 48.175910 
## iter  10 value 26.785075
## iter  20 value 8.107855
## iter  30 value 7.501987
## iter  40 value 0.727674
## iter  50 value 0.157836
## iter  60 value 0.139401
## iter  70 value 0.132482
## iter  80 value 0.126400
## iter  90 value 0.115716
## iter 100 value 0.105940
## final  value 0.105940 
## stopped after 100 iterations
## # weights:  42
## initial  value 38.025891 
## iter  10 value 37.429979
## final  value 37.429948 
## converged
## # weights:  124
## initial  value 41.937550 
## iter  10 value 25.479599
## iter  20 value 13.946206
## iter  30 value 12.920776
## iter  40 value 9.896111
## iter  50 value 4.477071
## iter  60 value 4.331683
## iter  70 value 4.318635
## iter  80 value 4.314849
## iter  90 value 4.309166
## iter 100 value 0.119476
## final  value 0.119476 
## stopped after 100 iterations
## # weights:  206
## initial  value 38.479977 
## iter  10 value 14.251711
## iter  20 value 12.162107
## iter  30 value 3.251655
## iter  40 value 2.779575
## iter  50 value 2.773293
## iter  60 value 2.772988
## iter  70 value 2.771518
## iter  80 value 1.919216
## iter  90 value 1.910029
## iter 100 value 1.909783
## final  value 1.909783 
## stopped after 100 iterations
## # weights:  42
## initial  value 40.082722 
## iter  10 value 31.658924
## iter  20 value 26.534170
## iter  30 value 15.484922
## iter  40 value 10.425718
## iter  50 value 10.336434
## final  value 10.336409 
## converged
## # weights:  124
## initial  value 40.404025 
## iter  10 value 20.175198
## iter  20 value 11.633596
## iter  30 value 9.941047
## iter  40 value 8.830414
## iter  50 value 7.901000
## iter  60 value 7.700770
## iter  70 value 7.626051
## iter  80 value 7.612893
## iter  90 value 7.607735
## iter 100 value 7.607378
## final  value 7.607378 
## stopped after 100 iterations
## # weights:  206
## initial  value 47.511931 
## iter  10 value 28.905853
## iter  20 value 12.196812
## iter  30 value 9.096650
## iter  40 value 7.947732
## iter  50 value 7.558000
## iter  60 value 7.457165
## iter  70 value 7.231514
## iter  80 value 7.139516
## iter  90 value 7.131985
## iter 100 value 7.127636
## final  value 7.127636 
## stopped after 100 iterations
## # weights:  42
## initial  value 46.051386 
## iter  10 value 37.048507
## iter  20 value 26.521176
## iter  30 value 26.457851
## iter  40 value 24.976779
## iter  50 value 8.003395
## iter  60 value 4.377853
## iter  70 value 4.362012
## iter  80 value 4.359140
## iter  90 value 4.355717
## iter 100 value 4.350772
## final  value 4.350772 
## stopped after 100 iterations
## # weights:  124
## initial  value 39.002309 
## iter  10 value 19.603628
## iter  20 value 2.177405
## iter  30 value 0.263476
## iter  40 value 0.201591
## iter  50 value 0.166607
## iter  60 value 0.146680
## iter  70 value 0.116197
## iter  80 value 0.086895
## iter  90 value 0.072990
## iter 100 value 0.068951
## final  value 0.068951 
## stopped after 100 iterations
## # weights:  206
## initial  value 42.857255 
## iter  10 value 37.158422
## iter  20 value 7.071053
## iter  30 value 4.399584
## iter  40 value 4.382183
## iter  50 value 4.378997
## iter  60 value 4.375335
## iter  70 value 4.370412
## iter  80 value 0.493308
## iter  90 value 0.078963
## iter 100 value 0.067140
## final  value 0.067140 
## stopped after 100 iterations
## # weights:  42
## initial  value 40.210919 
## iter  10 value 23.440130
## iter  20 value 14.528303
## iter  30 value 13.880370
## iter  40 value 13.869418
## iter  50 value 13.869106
## iter  60 value 13.868993
## iter  70 value 13.868834
## final  value 13.868832 
## converged
## # weights:  124
## initial  value 37.134392 
## iter  10 value 24.560625
## iter  20 value 5.670535
## iter  30 value 4.108917
## iter  40 value 2.736939
## iter  50 value 2.711498
## iter  60 value 2.591346
## iter  70 value 0.147297
## iter  80 value 0.011545
## iter  90 value 0.002107
## iter 100 value 0.001299
## final  value 0.001299 
## stopped after 100 iterations
## # weights:  206
## initial  value 37.508940 
## iter  10 value 26.988082
## iter  20 value 16.595331
## iter  30 value 1.809778
## iter  40 value 0.023192
## iter  50 value 0.000441
## final  value 0.000091 
## converged
## # weights:  42
## initial  value 40.980765 
## iter  10 value 24.017315
## iter  20 value 13.412198
## iter  30 value 11.821037
## iter  40 value 11.758860
## final  value 11.758841 
## converged
## # weights:  124
## initial  value 43.031852 
## iter  10 value 27.048817
## iter  20 value 19.598104
## iter  30 value 15.326800
## iter  40 value 11.756266
## iter  50 value 10.378106
## iter  60 value 9.692596
## iter  70 value 9.640454
## iter  80 value 9.579284
## iter  90 value 9.492822
## iter 100 value 9.274343
## final  value 9.274343 
## stopped after 100 iterations
## # weights:  206
## initial  value 48.075681 
## iter  10 value 29.995421
## iter  20 value 20.785802
## iter  30 value 17.396694
## iter  40 value 13.077924
## iter  50 value 10.470727
## iter  60 value 9.444089
## iter  70 value 9.119904
## iter  80 value 9.045365
## iter  90 value 9.010697
## iter 100 value 9.009802
## final  value 9.009802 
## stopped after 100 iterations
## # weights:  42
## initial  value 39.769768 
## iter  10 value 21.853350
## iter  20 value 18.872817
## iter  30 value 16.090358
## iter  40 value 13.946389
## iter  50 value 13.927837
## iter  60 value 13.924106
## iter  70 value 13.918881
## iter  80 value 13.915289
## iter  90 value 10.601399
## iter 100 value 9.772545
## final  value 9.772545 
## stopped after 100 iterations
## # weights:  124
## initial  value 37.693101 
## iter  10 value 36.729380
## iter  20 value 36.551135
## iter  30 value 15.088364
## iter  40 value 2.012575
## iter  50 value 0.254324
## iter  60 value 0.221866
## iter  70 value 0.166803
## iter  80 value 0.141152
## iter  90 value 0.127642
## iter 100 value 0.118398
## final  value 0.118398 
## stopped after 100 iterations
## # weights:  206
## initial  value 38.127118 
## iter  10 value 18.792930
## iter  20 value 7.424094
## iter  30 value 4.590719
## iter  40 value 1.873572
## iter  50 value 0.192823
## iter  60 value 0.129307
## iter  70 value 0.122956
## iter  80 value 0.115540
## iter  90 value 0.102926
## iter 100 value 0.095656
## final  value 0.095656 
## stopped after 100 iterations
## # weights:  42
## initial  value 41.848642 
## iter  10 value 37.231512
## iter  20 value 19.239153
## iter  30 value 16.649046
## iter  40 value 16.585121
## iter  50 value 16.584918
## iter  50 value 16.584918
## iter  50 value 16.584918
## final  value 16.584918 
## converged
## # weights:  124
## initial  value 36.609565 
## iter  10 value 15.903635
## iter  20 value 6.593715
## iter  30 value 5.587703
## iter  40 value 5.318709
## iter  50 value 5.005900
## iter  60 value 5.004527
## iter  70 value 5.004215
## iter  80 value 5.003362
## iter  90 value 3.470911
## iter 100 value 3.139574
## final  value 3.139574 
## stopped after 100 iterations
## # weights:  206
## initial  value 37.337098 
## iter  10 value 18.589783
## iter  20 value 6.019227
## iter  30 value 4.177840
## iter  40 value 2.361041
## iter  50 value 2.265224
## iter  60 value 2.261691
## iter  70 value 2.223261
## iter  80 value 0.102420
## iter  90 value 0.016229
## iter 100 value 0.007621
## final  value 0.007621 
## stopped after 100 iterations
## # weights:  42
## initial  value 38.425938 
## iter  10 value 31.910871
## iter  20 value 18.775650
## iter  30 value 14.877947
## iter  40 value 14.477471
## final  value 14.477253 
## converged
## # weights:  124
## initial  value 39.343005 
## iter  10 value 27.143197
## iter  20 value 18.447820
## iter  30 value 14.202375
## iter  40 value 11.691931
## iter  50 value 11.096465
## iter  60 value 10.669250
## iter  70 value 10.515583
## iter  80 value 10.371746
## iter  90 value 10.369354
## final  value 10.369354 
## converged
## # weights:  206
## initial  value 43.062347 
## iter  10 value 26.378602
## iter  20 value 17.059320
## iter  30 value 12.340350
## iter  40 value 11.398732
## iter  50 value 10.793200
## iter  60 value 10.279638
## iter  70 value 10.172333
## iter  80 value 10.115973
## iter  90 value 10.076911
## iter 100 value 10.058568
## final  value 10.058568 
## stopped after 100 iterations
## # weights:  42
## initial  value 37.436256 
## iter  10 value 32.272369
## iter  20 value 31.029835
## iter  30 value 25.305771
## iter  40 value 24.153967
## iter  50 value 24.109644
## iter  60 value 21.752349
## iter  70 value 16.331701
## iter  80 value 14.066942
## iter  90 value 13.932903
## iter 100 value 12.007984
## final  value 12.007984 
## stopped after 100 iterations
## # weights:  124
## initial  value 39.053581 
## iter  10 value 36.855421
## iter  20 value 24.644430
## iter  30 value 18.705487
## iter  40 value 14.499210
## iter  50 value 13.360534
## iter  60 value 13.214521
## iter  70 value 8.993740
## iter  80 value 8.600980
## iter  90 value 8.217186
## iter 100 value 8.064280
## final  value 8.064280 
## stopped after 100 iterations
## # weights:  206
## initial  value 54.459283 
## iter  10 value 35.952840
## iter  20 value 22.245430
## iter  30 value 18.229399
## iter  40 value 14.248714
## iter  50 value 14.036386
## iter  60 value 11.155018
## iter  70 value 10.694541
## iter  80 value 7.527516
## iter  90 value 6.583679
## iter 100 value 6.234755
## final  value 6.234755 
## stopped after 100 iterations
## # weights:  42
## initial  value 37.488085 
## final  value 36.727366 
## converged
## # weights:  124
## initial  value 35.593471 
## iter  10 value 13.092325
## iter  20 value 7.302160
## iter  30 value 7.277931
## iter  40 value 7.277699
## final  value 7.277689 
## converged
## # weights:  206
## initial  value 37.897894 
## iter  10 value 17.211991
## iter  20 value 8.508231
## iter  30 value 1.642643
## iter  40 value 1.392920
## iter  50 value 1.387089
## iter  60 value 1.386601
## iter  70 value 1.386296
## final  value 1.386296 
## converged
## # weights:  42
## initial  value 39.009752 
## iter  10 value 20.348810
## iter  20 value 13.347197
## iter  30 value 13.102301
## final  value 13.102221 
## converged
## # weights:  124
## initial  value 38.887363 
## iter  10 value 23.390790
## iter  20 value 10.848009
## iter  30 value 8.731017
## iter  40 value 8.643957
## iter  50 value 8.483798
## iter  60 value 8.299049
## iter  70 value 8.290567
## final  value 8.290558 
## converged
## # weights:  206
## initial  value 45.105708 
## iter  10 value 18.617624
## iter  20 value 10.336690
## iter  30 value 8.566743
## iter  40 value 8.134603
## iter  50 value 8.046936
## iter  60 value 7.974381
## iter  70 value 7.942497
## iter  80 value 7.925047
## iter  90 value 7.915593
## final  value 7.915552 
## converged
## # weights:  42
## initial  value 39.307927 
## iter  10 value 28.491216
## iter  20 value 21.391781
## iter  30 value 12.385011
## iter  40 value 7.335279
## iter  50 value 7.320454
## iter  60 value 7.307147
## iter  70 value 7.295089
## iter  80 value 7.289126
## iter  90 value 7.279025
## iter 100 value 7.260275
## final  value 7.260275 
## stopped after 100 iterations
## # weights:  124
## initial  value 39.441352 
## iter  10 value 20.566067
## iter  20 value 7.334785
## iter  30 value 0.447768
## iter  40 value 0.238724
## iter  50 value 0.210096
## iter  60 value 0.168606
## iter  70 value 0.133141
## iter  80 value 0.119019
## iter  90 value 0.091930
## iter 100 value 0.088367
## final  value 0.088367 
## stopped after 100 iterations
## # weights:  206
## initial  value 73.345166 
## iter  10 value 36.745044
## iter  20 value 14.583059
## iter  30 value 11.511169
## iter  40 value 11.434123
## iter  50 value 11.421317
## iter  60 value 11.419148
## iter  70 value 11.414254
## iter  80 value 8.660708
## iter  90 value 8.572572
## iter 100 value 8.565985
## final  value 8.565985 
## stopped after 100 iterations
## # weights:  42
## initial  value 36.712305 
## iter  10 value 27.094489
## iter  20 value 16.775631
## iter  30 value 11.535340
## iter  40 value 11.410580
## iter  50 value 11.403638
## iter  60 value 11.403552
## iter  70 value 11.403527
## final  value 11.403526 
## converged
## # weights:  124
## initial  value 41.967192 
## iter  10 value 20.457428
## iter  20 value 11.383853
## iter  30 value 5.180016
## iter  40 value 3.216818
## iter  50 value 1.459336
## iter  60 value 1.396565
## iter  70 value 0.123482
## iter  80 value 0.019019
## iter  90 value 0.006086
## iter 100 value 0.002040
## final  value 0.002040 
## stopped after 100 iterations
## # weights:  206
## initial  value 40.505568 
## iter  10 value 21.572352
## iter  20 value 6.603643
## iter  30 value 6.169343
## iter  40 value 2.158371
## iter  50 value 1.941948
## iter  60 value 1.913800
## iter  70 value 1.911336
## iter  80 value 1.910117
## iter  90 value 1.909900
## iter 100 value 1.909622
## final  value 1.909622 
## stopped after 100 iterations
## # weights:  42
## initial  value 37.857127 
## iter  10 value 31.957989
## iter  20 value 19.670359
## iter  30 value 13.929724
## iter  40 value 11.279966
## iter  50 value 11.029412
## final  value 11.029316 
## converged
## # weights:  124
## initial  value 38.342936 
## iter  10 value 17.567166
## iter  20 value 10.899773
## iter  30 value 10.042349
## iter  40 value 8.967881
## iter  50 value 8.749023
## iter  60 value 8.519625
## iter  70 value 8.461196
## iter  80 value 8.460736
## final  value 8.460734 
## converged
## # weights:  206
## initial  value 47.154179 
## iter  10 value 32.018575
## iter  20 value 12.543444
## iter  30 value 10.305182
## iter  40 value 8.608957
## iter  50 value 8.058560
## iter  60 value 8.028421
## iter  70 value 8.027040
## final  value 8.027019 
## converged
## # weights:  42
## initial  value 37.106244 
## iter  10 value 17.143074
## iter  20 value 10.541141
## iter  30 value 7.283446
## iter  40 value 4.429751
## iter  50 value 4.345495
## iter  60 value 4.337511
## iter  70 value 4.330028
## iter  80 value 4.325601
## iter  90 value 4.322300
## iter 100 value 4.320161
## final  value 4.320161 
## stopped after 100 iterations
## # weights:  124
## initial  value 45.178614 
## iter  10 value 25.092366
## iter  20 value 0.347391
## iter  30 value 0.201260
## iter  40 value 0.186050
## iter  50 value 0.163188
## iter  60 value 0.151108
## iter  70 value 0.134415
## iter  80 value 0.119796
## iter  90 value 0.109507
## iter 100 value 0.092051
## final  value 0.092051 
## stopped after 100 iterations
## # weights:  206
## initial  value 38.835100 
## iter  10 value 18.634577
## iter  20 value 7.057020
## iter  30 value 1.781651
## iter  40 value 0.554168
## iter  50 value 0.291356
## iter  60 value 0.264583
## iter  70 value 0.224329
## iter  80 value 0.200153
## iter  90 value 0.184896
## iter 100 value 0.156526
## final  value 0.156526 
## stopped after 100 iterations
## # weights:  124
## initial  value 59.503329 
## iter  10 value 31.533984
## iter  20 value 23.244036
## iter  30 value 15.611695
## iter  40 value 13.993738
## iter  50 value 13.216017
## iter  60 value 12.999943
## iter  70 value 12.882509
## iter  80 value 12.844642
## iter  90 value 12.776793
## iter 100 value 12.766262
## final  value 12.766262 
## stopped after 100 iterations
tictoc::toc()
## 4.58 sec elapsed
# Train neural net
tictoc::tic()
set.seed(42)
nn_model_voters_big_grid <- caret::train(turnout16_2016 ~ ., data = voters_train_data, method = "nnet", 
                                         trControl = fitControl, verbose = FALSE, tuneGrid = big_grid
                                         )
## # weights:  42
## initial  value 37.543902 
## iter  10 value 31.027084
## iter  20 value 11.826996
## iter  30 value 11.383068
## iter  40 value 11.369768
## iter  50 value 11.368247
## iter  60 value 11.368065
## final  value 11.368020 
## converged
## # weights:  83
## initial  value 39.766022 
## iter  10 value 15.937860
## iter  20 value 11.422396
## iter  30 value 11.375451
## iter  40 value 11.368608
## iter  50 value 11.368059
## final  value 11.368036 
## converged
## # weights:  124
## initial  value 40.595251 
## iter  10 value 23.262080
## iter  20 value 18.405645
## iter  30 value 15.651999
## iter  40 value 15.647162
## iter  50 value 15.646567
## iter  60 value 14.041820
## iter  70 value 13.872262
## iter  80 value 13.869025
## final  value 13.868985 
## converged
## # weights:  165
## initial  value 40.365339 
## iter  10 value 23.967523
## iter  20 value 6.106408
## iter  30 value 5.743379
## iter  40 value 0.036464
## iter  50 value 0.004550
## iter  60 value 0.000387
## final  value 0.000089 
## converged
## # weights:  206
## initial  value 38.064241 
## iter  10 value 13.838685
## iter  20 value 7.748838
## iter  30 value 7.157892
## iter  40 value 7.148233
## iter  50 value 7.147933
## iter  60 value 7.147907
## final  value 7.147900 
## converged
## # weights:  42
## initial  value 43.026823 
## iter  10 value 34.625575
## iter  20 value 32.606851
## iter  30 value 32.598476
## final  value 32.598460 
## converged
## # weights:  83
## initial  value 54.080772 
## iter  10 value 36.149285
## iter  20 value 33.706759
## iter  30 value 32.579531
## iter  40 value 30.532952
## iter  50 value 30.381645
## final  value 30.381475 
## converged
## # weights:  124
## initial  value 55.946923 
## iter  10 value 34.648882
## iter  20 value 30.961345
## iter  30 value 29.938476
## iter  40 value 29.924421
## iter  50 value 29.922615
## final  value 29.922606 
## converged
## # weights:  165
## initial  value 76.853817 
## iter  10 value 35.478352
## iter  20 value 30.177435
## iter  30 value 29.678601
## iter  40 value 29.472089
## iter  50 value 29.398254
## iter  60 value 29.393732
## iter  70 value 29.393702
## iter  70 value 29.393702
## iter  70 value 29.393702
## final  value 29.393702 
## converged
## # weights:  206
## initial  value 72.169357 
## iter  10 value 35.147438
## iter  20 value 30.651734
## iter  30 value 29.489206
## iter  40 value 29.326520
## iter  50 value 29.214437
## iter  60 value 29.206726
## final  value 29.205957 
## converged
## # weights:  42
## initial  value 37.568710 
## iter  10 value 20.528396
## iter  20 value 16.713563
## iter  30 value 12.500015
## iter  40 value 11.459355
## iter  50 value 11.447812
## iter  60 value 11.446075
## iter  70 value 11.444896
## iter  80 value 11.444180
## iter  90 value 11.444148
## iter 100 value 11.443674
## final  value 11.443674 
## stopped after 100 iterations
## # weights:  83
## initial  value 41.854691 
## iter  10 value 14.541530
## iter  20 value 10.128697
## iter  30 value 9.762728
## iter  40 value 9.499098
## iter  50 value 4.706413
## iter  60 value 4.320181
## iter  70 value 4.315702
## iter  80 value 4.315041
## iter  90 value 4.314585
## iter 100 value 4.314517
## final  value 4.314517 
## stopped after 100 iterations
## # weights:  124
## initial  value 39.901995 
## iter  10 value 29.119445
## iter  20 value 12.304345
## iter  30 value 5.889038
## iter  40 value 5.751436
## iter  50 value 5.732129
## iter  60 value 5.443309
## iter  70 value 0.022520
## final  value 0.000051 
## converged
## # weights:  165
## initial  value 39.029729 
## iter  10 value 15.710362
## iter  20 value 0.749600
## iter  30 value 0.005668
## iter  40 value 0.000122
## iter  40 value 0.000074
## iter  40 value 0.000070
## final  value 0.000070 
## converged
## # weights:  206
## initial  value 47.157550 
## iter  10 value 23.896673
## iter  20 value 17.318332
## iter  30 value 10.567425
## iter  40 value 9.767979
## iter  50 value 9.756515
## iter  60 value 9.753882
## iter  70 value 9.689147
## iter  80 value 7.306195
## iter  90 value 7.278802
## iter 100 value 5.609310
## final  value 5.609310 
## stopped after 100 iterations
## # weights:  42
## initial  value 45.367030 
## iter  10 value 33.596810
## iter  20 value 32.298023
## final  value 32.297824 
## converged
## # weights:  83
## initial  value 53.163581 
## iter  10 value 36.185830
## iter  20 value 32.426721
## iter  30 value 30.774634
## iter  40 value 29.834471
## iter  50 value 29.831662
## final  value 29.831646 
## converged
## # weights:  124
## initial  value 62.253214 
## iter  10 value 33.520095
## iter  20 value 29.650404
## iter  30 value 29.299605
## iter  40 value 29.240396
## final  value 29.240151 
## converged
## # weights:  165
## initial  value 63.239335 
## iter  10 value 33.441372
## iter  20 value 29.287492
## iter  30 value 28.650045
## iter  40 value 28.597879
## iter  50 value 28.597343
## final  value 28.597343 
## converged
## # weights:  206
## initial  value 85.700902 
## iter  10 value 34.554528
## iter  20 value 30.436849
## iter  30 value 28.637195
## iter  40 value 28.350130
## iter  50 value 28.347024
## final  value 28.347017 
## converged
## # weights:  42
## initial  value 37.900580 
## iter  10 value 23.617936
## iter  20 value 18.547715
## iter  30 value 18.521900
## iter  40 value 18.519423
## iter  50 value 18.518999
## iter  60 value 18.515145
## iter  70 value 16.570951
## iter  80 value 14.237539
## iter  90 value 14.181777
## iter 100 value 14.180876
## final  value 14.180876 
## stopped after 100 iterations
## # weights:  83
## initial  value 36.916722 
## iter  10 value 36.727349
## iter  20 value 35.059780
## iter  30 value 32.200624
## iter  40 value 13.899447
## iter  50 value 13.698580
## iter  60 value 13.695990
## final  value 13.695938 
## converged
## # weights:  124
## initial  value 37.451885 
## iter  10 value 18.013827
## iter  20 value 13.619063
## iter  30 value 8.391131
## iter  40 value 8.230179
## iter  50 value 7.715214
## iter  60 value 7.686716
## iter  70 value 7.685461
## iter  80 value 7.614544
## iter  90 value 7.600645
## iter 100 value 7.514793
## final  value 7.514793 
## stopped after 100 iterations
## # weights:  165
## initial  value 38.247757 
## iter  10 value 22.338755
## iter  20 value 13.909412
## iter  30 value 13.738750
## iter  40 value 13.734036
## iter  50 value 13.732982
## iter  60 value 13.732601
## final  value 13.732598 
## converged
## # weights:  206
## initial  value 38.440112 
## iter  10 value 19.380722
## iter  20 value 9.743034
## iter  30 value 7.411345
## iter  40 value 0.088703
## iter  50 value 0.003649
## final  value 0.000089 
## converged
## # weights:  42
## initial  value 46.391221 
## iter  10 value 36.545401
## iter  20 value 31.412226
## iter  30 value 31.230800
## final  value 31.230763 
## converged
## # weights:  83
## initial  value 55.121059 
## iter  10 value 36.549576
## iter  20 value 31.908430
## iter  30 value 31.344571
## iter  40 value 30.844813
## iter  50 value 30.446359
## iter  60 value 30.444957
## final  value 30.444954 
## converged
## # weights:  124
## initial  value 65.332656 
## iter  10 value 34.743900
## iter  20 value 31.124164
## iter  30 value 28.914376
## iter  40 value 28.108523
## iter  50 value 28.022216
## iter  60 value 28.020085
## iter  60 value 28.020085
## iter  60 value 28.020085
## final  value 28.020085 
## converged
## # weights:  165
## initial  value 65.640103 
## iter  10 value 31.558324
## iter  20 value 29.115526
## iter  30 value 28.127830
## iter  40 value 27.736573
## iter  50 value 27.404964
## iter  60 value 27.398359
## final  value 27.398348 
## converged
## # weights:  206
## initial  value 74.842233 
## iter  10 value 32.556911
## iter  20 value 27.873902
## iter  30 value 27.386812
## iter  40 value 27.191109
## iter  50 value 27.173951
## iter  60 value 27.171327
## iter  60 value 27.171327
## iter  60 value 27.171327
## final  value 27.171327 
## converged
## # weights:  42
## initial  value 38.783776 
## iter  10 value 15.839496
## iter  20 value 9.031899
## iter  30 value 4.373613
## iter  40 value 4.315676
## iter  50 value 4.315058
## iter  60 value 4.314250
## iter  70 value 4.314186
## final  value 4.314131 
## converged
## # weights:  83
## initial  value 38.695867 
## iter  10 value 13.075593
## iter  20 value 4.754923
## iter  30 value 0.096104
## iter  40 value 0.002264
## iter  50 value 0.000312
## final  value 0.000083 
## converged
## # weights:  124
## initial  value 46.247989 
## iter  10 value 15.875757
## iter  20 value 4.075813
## iter  30 value 2.506456
## iter  40 value 2.502080
## final  value 2.502012 
## converged
## # weights:  165
## initial  value 44.061821 
## iter  10 value 20.078951
## iter  20 value 3.574405
## iter  30 value 0.243555
## iter  40 value 0.002424
## iter  50 value 0.000577
## iter  60 value 0.000115
## iter  60 value 0.000092
## iter  60 value 0.000090
## final  value 0.000090 
## converged
## # weights:  206
## initial  value 37.575512 
## iter  10 value 24.986633
## iter  20 value 18.820758
## iter  30 value 15.694899
## iter  40 value 15.650033
## iter  50 value 15.648400
## iter  60 value 15.646978
## iter  70 value 15.587156
## iter  80 value 6.940235
## iter  90 value 4.317656
## iter 100 value 4.315110
## final  value 4.315110 
## stopped after 100 iterations
## # weights:  42
## initial  value 48.033054 
## iter  10 value 35.563953
## iter  20 value 32.471288
## iter  30 value 32.435824
## final  value 32.435717 
## converged
## # weights:  83
## initial  value 58.979426 
## iter  10 value 34.264113
## iter  20 value 32.456782
## iter  30 value 32.132788
## iter  40 value 31.678700
## iter  50 value 31.677489
## final  value 31.677474 
## converged
## # weights:  124
## initial  value 58.111063 
## iter  10 value 35.757496
## iter  20 value 30.458979
## iter  30 value 29.445538
## iter  40 value 29.401603
## iter  50 value 29.397377
## iter  60 value 29.396892
## final  value 29.396879 
## converged
## # weights:  165
## initial  value 60.842265 
## iter  10 value 35.095744
## iter  20 value 30.670761
## iter  30 value 29.565826
## iter  40 value 29.440583
## iter  50 value 29.085018
## iter  60 value 28.835858
## iter  70 value 28.832016
## iter  80 value 28.831029
## final  value 28.831026 
## converged
## # weights:  206
## initial  value 70.575570 
## iter  10 value 36.419099
## iter  20 value 30.925942
## iter  30 value 29.213017
## iter  40 value 28.678287
## iter  50 value 28.617776
## iter  60 value 28.604114
## iter  70 value 28.603628
## final  value 28.603595 
## converged
## # weights:  42
## initial  value 36.113034 
## iter  10 value 18.542105
## iter  20 value 14.102225
## iter  30 value 11.337182
## iter  40 value 11.331034
## iter  50 value 11.329888
## iter  60 value 11.329013
## iter  70 value 11.328435
## iter  80 value 11.328078
## iter  90 value 11.328047
## final  value 11.328043 
## converged
## # weights:  83
## initial  value 36.167664 
## iter  10 value 29.608404
## iter  20 value 11.657390
## iter  30 value 11.343257
## iter  40 value 9.944105
## iter  50 value 8.479723
## iter  60 value 8.477983
## iter  70 value 8.477756
## iter  80 value 8.477655
## iter  90 value 8.477421
## iter 100 value 8.477355
## final  value 8.477355 
## stopped after 100 iterations
## # weights:  124
## initial  value 36.617189 
## iter  10 value 18.844325
## iter  20 value 15.444874
## iter  30 value 11.978516
## iter  40 value 9.670243
## iter  50 value 9.646207
## iter  60 value 9.645359
## final  value 9.645357 
## converged
## # weights:  165
## initial  value 37.192876 
## iter  10 value 23.741880
## iter  20 value 9.040261
## iter  30 value 4.315039
## iter  40 value 2.795581
## iter  50 value 2.726782
## iter  60 value 2.705688
## iter  70 value 2.332897
## iter  80 value 2.246385
## iter  90 value 1.925591
## iter 100 value 1.913708
## final  value 1.913708 
## stopped after 100 iterations
## # weights:  206
## initial  value 36.801803 
## iter  10 value 15.150003
## iter  20 value 11.372518
## iter  30 value 11.339694
## iter  40 value 11.333059
## iter  50 value 11.328783
## iter  60 value 11.328400
## iter  70 value 11.328036
## iter  80 value 11.328018
## final  value 11.328013 
## converged
## # weights:  42
## initial  value 52.712783 
## iter  10 value 35.810004
## iter  20 value 31.692653
## iter  30 value 31.521966
## final  value 31.521780 
## converged
## # weights:  83
## initial  value 49.791005 
## iter  10 value 31.784732
## iter  20 value 30.300636
## iter  30 value 29.222412
## iter  40 value 29.216654
## final  value 29.216653 
## converged
## # weights:  124
## initial  value 61.333051 
## iter  10 value 32.386772
## iter  20 value 29.598959
## iter  30 value 28.721693
## iter  40 value 28.696703
## final  value 28.696698 
## converged
## # weights:  165
## initial  value 67.695635 
## iter  10 value 33.636308
## iter  20 value 29.364281
## iter  30 value 28.887317
## iter  40 value 28.153950
## iter  50 value 28.145835
## iter  60 value 28.145565
## final  value 28.145496 
## converged
## # weights:  206
## initial  value 76.448638 
## iter  10 value 36.245376
## iter  20 value 29.422675
## iter  30 value 28.971459
## iter  40 value 28.200392
## iter  50 value 27.975481
## iter  60 value 27.955356
## iter  70 value 27.949378
## iter  80 value 27.947282
## final  value 27.947256 
## converged
## # weights:  42
## initial  value 39.579312 
## iter  10 value 25.728578
## iter  20 value 20.797249
## iter  30 value 16.223903
## iter  40 value 15.871700
## iter  50 value 13.873213
## iter  60 value 10.919115
## iter  70 value 9.764265
## iter  80 value 9.754678
## iter  90 value 9.752892
## iter 100 value 9.752708
## final  value 9.752708 
## stopped after 100 iterations
## # weights:  83
## initial  value 39.264501 
## iter  10 value 26.315236
## iter  20 value 19.062494
## iter  30 value 18.815772
## iter  40 value 18.814083
## final  value 18.814056 
## converged
## # weights:  124
## initial  value 40.700436 
## iter  10 value 18.236900
## iter  20 value 9.173589
## iter  30 value 8.959864
## iter  40 value 8.909185
## iter  50 value 8.907125
## iter  60 value 8.906041
## iter  70 value 8.905927
## iter  80 value 8.905240
## iter  90 value 8.851523
## iter 100 value 8.762926
## final  value 8.762926 
## stopped after 100 iterations
## # weights:  165
## initial  value 40.356085 
## iter  10 value 25.887552
## iter  20 value 10.215281
## iter  30 value 9.341010
## iter  40 value 8.065364
## iter  50 value 4.176078
## iter  60 value 2.814463
## iter  70 value 0.113483
## iter  80 value 0.037607
## iter  90 value 0.006605
## iter 100 value 0.002313
## final  value 0.002313 
## stopped after 100 iterations
## # weights:  206
## initial  value 38.239559 
## iter  10 value 19.813053
## iter  20 value 8.539581
## iter  30 value 0.488021
## iter  40 value 0.005855
## final  value 0.000068 
## converged
## # weights:  42
## initial  value 45.361364 
## iter  10 value 37.176956
## iter  20 value 35.403506
## iter  30 value 32.639430
## iter  40 value 32.618884
## final  value 32.618861 
## converged
## # weights:  83
## initial  value 53.956602 
## iter  10 value 35.413312
## iter  20 value 30.685068
## iter  30 value 30.145695
## final  value 30.145006 
## converged
## # weights:  124
## initial  value 54.324599 
## iter  10 value 37.124107
## iter  20 value 31.495950
## iter  30 value 29.756525
## iter  40 value 29.654111
## iter  50 value 29.649305
## final  value 29.648723 
## converged
## # weights:  165
## initial  value 69.968800 
## iter  10 value 36.451110
## iter  20 value 30.937750
## iter  30 value 29.618580
## iter  40 value 29.082817
## iter  50 value 29.063351
## iter  60 value 29.057411
## iter  70 value 29.056566
## final  value 29.056553 
## converged
## # weights:  206
## initial  value 77.828002 
## iter  10 value 36.669934
## iter  20 value 31.609428
## iter  30 value 29.233791
## iter  40 value 28.915612
## iter  50 value 28.858228
## iter  60 value 28.847193
## iter  70 value 28.840738
## iter  80 value 28.839919
## final  value 28.839898 
## converged
## # weights:  42
## initial  value 37.145482 
## iter  10 value 21.377848
## iter  20 value 13.923663
## iter  30 value 11.360677
## iter  40 value 11.342557
## iter  50 value 11.334197
## iter  60 value 11.329747
## iter  70 value 11.290623
## iter  80 value 8.573542
## iter  90 value 8.483375
## iter 100 value 8.478289
## final  value 8.478289 
## stopped after 100 iterations
## # weights:  83
## initial  value 35.791277 
## iter  10 value 18.463347
## iter  20 value 17.586643
## iter  30 value 17.585594
## iter  40 value 17.585199
## iter  50 value 17.584804
## final  value 17.584799 
## converged
## # weights:  124
## initial  value 40.551560 
## iter  10 value 26.117834
## iter  20 value 22.104230
## iter  30 value 22.075337
## iter  40 value 22.074941
## iter  50 value 21.554845
## iter  60 value 15.534701
## iter  70 value 12.126068
## iter  80 value 12.109831
## iter  90 value 12.108911
## iter 100 value 12.108873
## final  value 12.108873 
## stopped after 100 iterations
## # weights:  165
## initial  value 36.533399 
## iter  10 value 22.153829
## iter  20 value 9.480934
## iter  30 value 2.065191
## iter  40 value 1.911985
## iter  50 value 1.909798
## iter  60 value 1.908344
## iter  70 value 1.877561
## iter  80 value 1.396355
## iter  90 value 1.386383
## iter 100 value 1.386337
## final  value 1.386337 
## stopped after 100 iterations
## # weights:  206
## initial  value 39.549691 
## iter  10 value 11.099303
## iter  20 value 4.188443
## iter  30 value 1.538773
## iter  40 value 0.069100
## iter  50 value 0.003998
## iter  60 value 0.000697
## iter  70 value 0.000285
## iter  80 value 0.000102
## iter  80 value 0.000098
## iter  80 value 0.000097
## final  value 0.000097 
## converged
## # weights:  42
## initial  value 46.978003 
## iter  10 value 32.526409
## iter  20 value 31.509557
## iter  30 value 31.508021
## iter  30 value 31.508021
## iter  30 value 31.508021
## final  value 31.508021 
## converged
## # weights:  83
## initial  value 52.315328 
## iter  10 value 33.920356
## iter  20 value 30.066642
## iter  30 value 29.337973
## iter  40 value 29.333758
## final  value 29.333757 
## converged
## # weights:  124
## initial  value 67.674335 
## iter  10 value 32.237658
## iter  20 value 28.984203
## iter  30 value 28.830153
## iter  40 value 28.822539
## final  value 28.822396 
## converged
## # weights:  165
## initial  value 68.606108 
## iter  10 value 34.225190
## iter  20 value 29.546285
## iter  30 value 28.472313
## iter  40 value 28.234751
## iter  50 value 28.229553
## final  value 28.229485 
## converged
## # weights:  206
## initial  value 72.350117 
## iter  10 value 34.544980
## iter  20 value 29.035050
## iter  30 value 28.193454
## iter  40 value 28.040294
## iter  50 value 28.031525
## iter  60 value 28.030020
## iter  70 value 28.029765
## final  value 28.029764 
## converged
## # weights:  42
## initial  value 38.983841 
## iter  10 value 36.945043
## iter  20 value 17.846729
## iter  30 value 13.908689
## iter  40 value 13.848704
## iter  50 value 13.844514
## iter  60 value 13.843839
## final  value 13.843836 
## converged
## # weights:  83
## initial  value 40.599302 
## iter  10 value 26.510421
## iter  20 value 24.077895
## iter  30 value 24.072512
## iter  40 value 22.865715
## iter  50 value 22.863960
## final  value 22.863958 
## converged
## # weights:  124
## initial  value 40.091939 
## iter  10 value 15.076666
## iter  20 value 3.885296
## iter  30 value 0.080618
## iter  40 value 0.003721
## iter  50 value 0.000743
## iter  60 value 0.000383
## final  value 0.000098 
## converged
## # weights:  165
## initial  value 36.524116 
## iter  10 value 12.801265
## iter  20 value 9.754990
## iter  30 value 9.752600
## iter  40 value 9.752516
## final  value 9.752502 
## converged
## # weights:  206
## initial  value 44.059291 
## iter  10 value 26.028730
## iter  20 value 10.028902
## iter  30 value 7.359834
## iter  40 value 7.055501
## iter  50 value 4.317877
## iter  60 value 4.315698
## iter  70 value 4.315376
## iter  80 value 4.314719
## iter  90 value 4.314596
## iter 100 value 4.314551
## final  value 4.314551 
## stopped after 100 iterations
## # weights:  42
## initial  value 44.829694 
## iter  10 value 37.164669
## iter  20 value 32.315274
## iter  30 value 31.597311
## final  value 31.597195 
## converged
## # weights:  83
## initial  value 53.846544 
## iter  10 value 36.708632
## iter  20 value 31.362736
## iter  30 value 30.697641
## iter  40 value 30.656172
## final  value 30.656138 
## converged
## # weights:  124
## initial  value 58.014676 
## iter  10 value 33.391484
## iter  20 value 30.639536
## iter  30 value 28.465306
## iter  40 value 28.220779
## iter  50 value 28.181563
## iter  60 value 28.180940
## final  value 28.180938 
## converged
## # weights:  165
## initial  value 72.087055 
## iter  10 value 31.654461
## iter  20 value 28.215493
## iter  30 value 27.941250
## iter  40 value 27.937605
## final  value 27.937505 
## converged
## # weights:  206
## initial  value 71.085007 
## iter  10 value 35.183898
## iter  20 value 30.340498
## iter  30 value 29.260276
## iter  40 value 27.512303
## iter  50 value 27.282594
## iter  60 value 27.261252
## iter  70 value 27.252063
## iter  80 value 27.249295
## final  value 27.249227 
## converged
## # weights:  42
## initial  value 37.263985 
## iter  10 value 22.037025
## iter  20 value 13.926130
## iter  30 value 13.868823
## final  value 13.868764 
## converged
## # weights:  83
## initial  value 41.939748 
## iter  10 value 22.431472
## iter  20 value 13.754009
## iter  30 value 11.483125
## iter  40 value 11.458611
## iter  50 value 11.447767
## iter  60 value 11.444439
## iter  70 value 11.443995
## iter  80 value 11.437038
## iter  90 value 8.574862
## iter 100 value 8.555570
## final  value 8.555570 
## stopped after 100 iterations
## # weights:  124
## initial  value 38.938475 
## iter  10 value 31.394824
## iter  20 value 22.874376
## iter  30 value 22.863896
## final  value 22.863843 
## converged
## # weights:  165
## initial  value 38.613796 
## iter  10 value 16.154408
## iter  20 value 7.209007
## iter  30 value 6.138269
## iter  40 value 6.109694
## iter  50 value 6.108725
## final  value 6.108643 
## converged
## # weights:  206
## initial  value 38.764107 
## iter  10 value 37.206892
## iter  20 value 26.683751
## iter  30 value 13.628183
## iter  40 value 4.789966
## iter  50 value 2.894662
## iter  60 value 2.708336
## iter  70 value 2.704055
## iter  80 value 2.466933
## iter  90 value 0.022585
## iter 100 value 0.010032
## final  value 0.010032 
## stopped after 100 iterations
## # weights:  42
## initial  value 42.679442 
## iter  10 value 36.491583
## iter  20 value 32.969804
## iter  30 value 32.953521
## final  value 32.953507 
## converged
## # weights:  83
## initial  value 51.776707 
## iter  10 value 33.758662
## iter  20 value 31.793297
## iter  30 value 30.589350
## iter  40 value 30.581061
## final  value 30.581041 
## converged
## # weights:  124
## initial  value 59.260609 
## iter  10 value 37.263219
## iter  20 value 32.358518
## iter  30 value 30.269878
## iter  40 value 30.074407
## iter  50 value 30.073594
## iter  50 value 30.073594
## iter  50 value 30.073594
## final  value 30.073594 
## converged
## # weights:  165
## initial  value 64.241710 
## iter  10 value 34.207571
## iter  20 value 30.538604
## iter  30 value 30.178039
## iter  40 value 30.170508
## iter  50 value 29.902557
## iter  60 value 29.581875
## iter  70 value 29.581523
## final  value 29.581522 
## converged
## # weights:  206
## initial  value 84.381514 
## iter  10 value 36.024289
## iter  20 value 31.694510
## iter  30 value 30.215405
## iter  40 value 29.804800
## iter  50 value 29.489896
## iter  60 value 29.416283
## iter  70 value 29.408418
## iter  80 value 29.377199
## iter  90 value 29.376424
## final  value 29.376424 
## converged
## # weights:  42
## initial  value 37.452108 
## iter  10 value 36.727379
## final  value 36.727366 
## converged
## # weights:  83
## initial  value 37.089823 
## iter  10 value 13.531837
## iter  20 value 10.761579
## iter  30 value 8.247449
## iter  40 value 7.303411
## iter  50 value 7.280418
## iter  60 value 7.277195
## iter  70 value 1.038631
## iter  80 value 0.012276
## iter  90 value 0.005247
## iter 100 value 0.000212
## final  value 0.000212 
## stopped after 100 iterations
## # weights:  124
## initial  value 40.586404 
## iter  10 value 23.029633
## iter  20 value 20.306395
## iter  30 value 17.778686
## iter  40 value 16.524855
## iter  50 value 15.901260
## iter  60 value 15.897095
## iter  70 value 15.895263
## iter  80 value 15.894860
## iter  90 value 15.893714
## final  value 15.893675 
## converged
## # weights:  165
## initial  value 39.170927 
## iter  10 value 19.291981
## iter  20 value 11.927351
## iter  30 value 11.920808
## iter  40 value 11.896874
## iter  50 value 9.204165
## iter  60 value 0.204938
## iter  70 value 0.007805
## iter  80 value 0.001966
## final  value 0.000059 
## converged
## # weights:  206
## initial  value 38.811497 
## iter  10 value 20.919958
## iter  20 value 10.000992
## iter  30 value 6.171836
## iter  40 value 4.971991
## iter  50 value 1.130037
## iter  60 value 0.052622
## iter  70 value 0.002908
## iter  80 value 0.000346
## final  value 0.000084 
## converged
## # weights:  42
## initial  value 46.073953 
## iter  10 value 33.768493
## iter  20 value 32.076328
## iter  30 value 32.074976
## final  value 32.074970 
## converged
## # weights:  83
## initial  value 52.801976 
## iter  10 value 35.325993
## iter  20 value 32.244403
## iter  30 value 31.853815
## iter  40 value 29.796529
## iter  50 value 29.776038
## iter  60 value 29.775947
## iter  60 value 29.775947
## iter  60 value 29.775947
## final  value 29.775947 
## converged
## # weights:  124
## initial  value 57.494961 
## iter  10 value 34.430731
## iter  20 value 30.159546
## iter  30 value 29.256161
## iter  40 value 29.250137
## iter  50 value 29.249347
## final  value 29.249344 
## converged
## # weights:  165
## initial  value 64.885477 
## iter  10 value 31.607186
## iter  20 value 29.044896
## iter  30 value 28.800421
## iter  40 value 28.713559
## iter  50 value 28.713147
## final  value 28.713135 
## converged
## # weights:  206
## initial  value 70.239027 
## iter  10 value 35.735459
## iter  20 value 29.687238
## iter  30 value 28.871590
## iter  40 value 28.711902
## iter  50 value 28.541197
## iter  60 value 28.509971
## iter  70 value 28.505564
## iter  80 value 28.505190
## final  value 28.505189 
## converged
## # weights:  42
## initial  value 37.541773 
## iter  10 value 26.400635
## iter  20 value 25.141264
## iter  30 value 25.127353
## iter  40 value 22.541044
## iter  50 value 22.522353
## iter  60 value 22.520717
## iter  70 value 22.520576
## iter  80 value 22.520534
## final  value 22.520533 
## converged
## # weights:  83
## initial  value 37.391155 
## iter  10 value 22.351072
## iter  20 value 9.988901
## iter  30 value 9.521244
## iter  40 value 8.445055
## iter  50 value 8.425709
## iter  60 value 8.064217
## iter  70 value 7.685241
## iter  80 value 7.155245
## iter  90 value 7.013396
## iter 100 value 7.012708
## final  value 7.012708 
## stopped after 100 iterations
## # weights:  124
## initial  value 41.516768 
## iter  10 value 13.882734
## iter  20 value 10.952788
## iter  30 value 8.604887
## iter  40 value 4.898050
## iter  50 value 4.241550
## iter  60 value 3.200280
## iter  70 value 2.869080
## iter  80 value 2.505753
## iter  90 value 2.268951
## iter 100 value 2.251914
## final  value 2.251914 
## stopped after 100 iterations
## # weights:  165
## initial  value 38.033717 
## iter  10 value 13.548790
## iter  20 value 4.327251
## iter  30 value 4.314282
## iter  40 value 4.314147
## final  value 4.314132 
## converged
## # weights:  206
## initial  value 45.709612 
## iter  10 value 15.650864
## iter  20 value 4.461805
## iter  30 value 4.317804
## iter  40 value 4.314156
## final  value 4.314131 
## converged
## # weights:  42
## initial  value 48.631896 
## iter  10 value 32.527159
## iter  20 value 31.633767
## final  value 31.633151 
## converged
## # weights:  83
## initial  value 52.475052 
## iter  10 value 35.157291
## iter  20 value 31.222149
## iter  30 value 29.143014
## iter  40 value 28.861232
## final  value 28.861136 
## converged
## # weights:  124
## initial  value 62.963033 
## iter  10 value 36.842213
## iter  20 value 29.941479
## iter  30 value 28.359467
## iter  40 value 28.263168
## iter  50 value 28.258782
## iter  60 value 28.258498
## final  value 28.258496 
## converged
## # weights:  165
## initial  value 73.162517 
## iter  10 value 33.074844
## iter  20 value 28.562802
## iter  30 value 27.726274
## iter  40 value 27.633304
## iter  50 value 27.630075
## final  value 27.629918 
## converged
## # weights:  206
## initial  value 75.727416 
## iter  10 value 35.714576
## iter  20 value 30.156688
## iter  30 value 28.468872
## iter  40 value 27.908545
## iter  50 value 27.465390
## iter  60 value 27.419375
## iter  70 value 27.405363
## iter  80 value 27.401569
## iter  90 value 27.401427
## final  value 27.401425 
## converged
## # weights:  42
## initial  value 36.802904 
## iter  10 value 36.724377
## iter  20 value 31.449425
## iter  30 value 23.571643
## iter  40 value 15.723925
## iter  50 value 15.647282
## iter  60 value 15.646654
## final  value 15.646623 
## converged
## # weights:  83
## initial  value 38.463306 
## iter  10 value 35.848627
## iter  20 value 24.139160
## iter  30 value 16.058381
## iter  40 value 9.385229
## iter  50 value 8.515019
## iter  60 value 8.485930
## iter  70 value 8.477116
## iter  80 value 8.476360
## iter  90 value 8.476063
## iter 100 value 8.475919
## final  value 8.475919 
## stopped after 100 iterations
## # weights:  124
## initial  value 36.862867 
## iter  10 value 24.102002
## iter  20 value 14.788449
## iter  30 value 7.908615
## iter  40 value 2.462946
## iter  50 value 1.482587
## iter  60 value 1.402463
## iter  70 value 1.390389
## iter  80 value 0.075347
## iter  90 value 0.019340
## iter 100 value 0.003484
## final  value 0.003484 
## stopped after 100 iterations
## # weights:  165
## initial  value 36.762600 
## iter  10 value 12.960179
## iter  20 value 0.734084
## iter  30 value 0.004818
## iter  40 value 0.000171
## final  value 0.000028 
## converged
## # weights:  206
## initial  value 38.056673 
## iter  10 value 17.217565
## iter  20 value 12.578088
## iter  30 value 11.819391
## iter  40 value 9.769091
## iter  50 value 9.756511
## iter  60 value 9.754197
## iter  70 value 7.364423
## iter  80 value 5.689687
## iter  90 value 4.318302
## iter 100 value 4.315607
## final  value 4.315607 
## stopped after 100 iterations
## # weights:  42
## initial  value 44.240704 
## iter  10 value 34.739134
## iter  20 value 32.274097
## iter  30 value 32.266622
## final  value 32.266620 
## converged
## # weights:  83
## initial  value 49.483770 
## iter  10 value 36.267661
## iter  20 value 31.362202
## iter  30 value 29.989687
## iter  40 value 29.844273
## final  value 29.844139 
## converged
## # weights:  124
## initial  value 58.342469 
## iter  10 value 35.906190
## iter  20 value 30.940039
## iter  30 value 29.624763
## iter  40 value 29.297640
## iter  50 value 29.294130
## iter  60 value 29.293907
## final  value 29.293897 
## converged
## # weights:  165
## initial  value 65.922376 
## iter  10 value 34.063292
## iter  20 value 29.995251
## iter  30 value 29.270770
## iter  40 value 28.775425
## iter  50 value 28.744018
## iter  60 value 28.743467
## iter  70 value 28.743405
## final  value 28.743405 
## converged
## # weights:  206
## initial  value 70.258019 
## iter  10 value 33.613285
## iter  20 value 29.340304
## iter  30 value 28.755190
## iter  40 value 28.709487
## iter  50 value 28.558564
## iter  60 value 28.549334
## iter  70 value 28.549282
## final  value 28.549281 
## converged
## # weights:  42
## initial  value 38.873502 
## iter  10 value 27.463174
## iter  20 value 25.496490
## iter  30 value 22.359805
## iter  40 value 22.043627
## iter  50 value 20.364376
## iter  60 value 17.122479
## iter  70 value 16.602842
## iter  80 value 16.592783
## iter  90 value 16.588267
## iter 100 value 16.586596
## final  value 16.586596 
## stopped after 100 iterations
## # weights:  83
## initial  value 38.028342 
## iter  10 value 21.272785
## iter  20 value 16.612603
## iter  30 value 14.899140
## iter  40 value 13.850710
## iter  50 value 13.847035
## iter  60 value 13.845480
## iter  70 value 13.844908
## iter  80 value 13.844190
## iter  90 value 13.844088
## iter 100 value 13.843964
## final  value 13.843964 
## stopped after 100 iterations
## # weights:  124
## initial  value 41.427311 
## iter  10 value 17.505370
## iter  20 value 11.810086
## iter  30 value 9.838674
## iter  40 value 9.822190
## iter  50 value 9.820558
## iter  60 value 9.819934
## final  value 9.819908 
## converged
## # weights:  165
## initial  value 38.377271 
## iter  10 value 25.540316
## iter  20 value 21.535290
## iter  30 value 9.879150
## iter  40 value 4.443147
## iter  50 value 3.864126
## iter  60 value 3.826843
## iter  70 value 3.576864
## iter  80 value 1.951754
## iter  90 value 1.921169
## iter 100 value 1.649334
## final  value 1.649334 
## stopped after 100 iterations
## # weights:  206
## initial  value 40.937530 
## iter  10 value 24.758108
## iter  20 value 16.530500
## iter  30 value 14.619314
## iter  40 value 13.432714
## iter  50 value 13.412821
## iter  60 value 13.406982
## iter  70 value 13.292743
## iter  80 value 13.287146
## iter  90 value 13.287051
## iter 100 value 13.286180
## final  value 13.286180 
## stopped after 100 iterations
## # weights:  42
## initial  value 46.893364 
## iter  10 value 37.235405
## iter  20 value 33.772756
## iter  30 value 33.593184
## final  value 33.593076 
## converged
## # weights:  83
## initial  value 50.076399 
## iter  10 value 35.004745
## iter  20 value 32.027528
## iter  30 value 31.857588
## iter  40 value 31.673598
## iter  50 value 31.670622
## final  value 31.670603 
## converged
## # weights:  124
## initial  value 64.144959 
## iter  10 value 35.992016
## iter  20 value 31.733684
## iter  30 value 31.245672
## iter  40 value 31.229032
## iter  50 value 31.226533
## iter  60 value 31.226044
## iter  60 value 31.226044
## iter  60 value 31.226044
## final  value 31.226044 
## converged
## # weights:  165
## initial  value 67.277161 
## iter  10 value 34.681115
## iter  20 value 31.738505
## iter  30 value 31.188559
## iter  40 value 31.122858
## iter  50 value 31.091790
## iter  60 value 31.091600
## final  value 31.091502 
## converged
## # weights:  206
## initial  value 71.726023 
## iter  10 value 36.297715
## iter  20 value 32.621907
## iter  30 value 30.881638
## iter  40 value 30.595635
## iter  50 value 30.557812
## iter  60 value 30.551557
## iter  70 value 30.551150
## final  value 30.551128 
## converged
## # weights:  42
## initial  value 38.761438 
## final  value 36.727366 
## converged
## # weights:  83
## initial  value 37.104068 
## iter  10 value 22.594903
## iter  20 value 11.389778
## iter  30 value 11.211085
## iter  40 value 11.207880
## final  value 11.207849 
## converged
## # weights:  124
## initial  value 37.679868 
## iter  10 value 22.544810
## iter  20 value 7.703849
## iter  30 value 1.568541
## iter  40 value 0.039338
## iter  50 value 0.000391
## final  value 0.000062 
## converged
## # weights:  165
## initial  value 37.314057 
## iter  10 value 14.393939
## iter  20 value 11.075259
## iter  30 value 8.577798
## iter  40 value 8.089404
## iter  50 value 8.066762
## iter  60 value 8.058540
## iter  70 value 7.609696
## iter  80 value 7.604990
## iter  90 value 7.604267
## iter 100 value 7.604159
## final  value 7.604159 
## stopped after 100 iterations
## # weights:  206
## initial  value 38.366015 
## iter  10 value 13.549450
## iter  20 value 4.258180
## iter  30 value 2.055016
## iter  40 value 1.929145
## iter  50 value 1.912337
## iter  60 value 1.911542
## iter  70 value 1.909752
## iter  80 value 0.269463
## iter  90 value 0.004006
## iter 100 value 0.000392
## final  value 0.000392 
## stopped after 100 iterations
## # weights:  42
## initial  value 42.532564 
## iter  10 value 31.737216
## iter  20 value 31.510626
## final  value 31.510570 
## converged
## # weights:  83
## initial  value 50.030179 
## iter  10 value 33.072089
## iter  20 value 28.982620
## iter  30 value 28.874714
## iter  40 value 28.873172
## final  value 28.873171 
## converged
## # weights:  124
## initial  value 58.128842 
## iter  10 value 32.990424
## iter  20 value 29.419985
## iter  30 value 28.852844
## iter  40 value 28.360847
## iter  50 value 28.276144
## iter  60 value 28.275304
## final  value 28.275300 
## converged
## # weights:  165
## initial  value 65.179239 
## iter  10 value 36.475489
## iter  20 value 30.015834
## iter  30 value 28.319212
## iter  40 value 27.766275
## iter  50 value 27.673135
## iter  60 value 27.672377
## iter  60 value 27.672377
## iter  60 value 27.672377
## final  value 27.672377 
## converged
## # weights:  206
## initial  value 66.436741 
## iter  10 value 37.164963
## iter  20 value 29.220663
## iter  30 value 27.831162
## iter  40 value 27.657073
## iter  50 value 27.456328
## iter  60 value 27.444083
## iter  70 value 27.441190
## iter  80 value 27.440661
## final  value 27.440652 
## converged
## # weights:  42
## initial  value 38.020564 
## iter  10 value 21.169013
## iter  20 value 11.902843
## iter  30 value 9.660604
## iter  40 value 9.649831
## iter  50 value 9.647173
## iter  60 value 9.645869
## iter  70 value 9.645364
## final  value 9.645304 
## converged
## # weights:  83
## initial  value 40.865443 
## iter  10 value 24.451119
## iter  20 value 7.701496
## iter  30 value 7.564427
## iter  40 value 7.563745
## iter  50 value 7.563664
## final  value 7.563662 
## converged
## # weights:  124
## initial  value 39.793548 
## iter  10 value 24.008245
## iter  20 value 11.701516
## iter  30 value 11.421202
## iter  40 value 11.407686
## iter  50 value 11.403961
## iter  60 value 11.403738
## iter  70 value 11.403705
## iter  80 value 11.403527
## final  value 11.403526 
## converged
## # weights:  165
## initial  value 38.076318 
## iter  10 value 22.377603
## iter  20 value 3.953223
## iter  30 value 0.039332
## iter  40 value 0.001438
## final  value 0.000049 
## converged
## # weights:  206
## initial  value 41.456225 
## iter  10 value 17.738450
## iter  20 value 4.442076
## iter  30 value 3.953717
## iter  40 value 3.764614
## iter  50 value 3.640465
## iter  60 value 3.633976
## iter  70 value 1.456619
## iter  80 value 1.407404
## iter  90 value 1.390347
## iter 100 value 1.386809
## final  value 1.386809 
## stopped after 100 iterations
## # weights:  42
## initial  value 42.903201 
## iter  10 value 36.084512
## iter  20 value 31.776249
## iter  30 value 31.728425
## final  value 31.728253 
## converged
## # weights:  83
## initial  value 49.933407 
## iter  10 value 34.858093
## iter  20 value 31.595276
## iter  30 value 29.684070
## iter  40 value 29.442436
## iter  50 value 29.441678
## final  value 29.441674 
## converged
## # weights:  124
## initial  value 59.243299 
## iter  10 value 33.343694
## iter  20 value 29.292536
## iter  30 value 28.969588
## iter  40 value 28.892957
## iter  50 value 28.890646
## iter  60 value 28.890512
## final  value 28.890511 
## converged
## # weights:  165
## initial  value 65.436414 
## iter  10 value 33.968407
## iter  20 value 29.725012
## iter  30 value 28.718684
## iter  40 value 28.366731
## iter  50 value 28.348157
## iter  60 value 28.347200
## final  value 28.347199 
## converged
## # weights:  206
## initial  value 74.187729 
## iter  10 value 33.723062
## iter  20 value 28.622307
## iter  30 value 28.335937
## iter  40 value 28.162297
## iter  50 value 28.154444
## iter  60 value 28.153724
## final  value 28.153721 
## converged
## # weights:  83
## initial  value 87.504295 
## iter  10 value 55.605142
## iter  20 value 50.715555
## iter  30 value 42.439019
## iter  40 value 42.146910
## iter  50 value 41.654480
## iter  60 value 41.424125
## iter  70 value 41.423215
## final  value 41.423095 
## converged
tictoc::toc()
## 4.85 sec elapsed
# Train control with random search
fitControl <- caret::trainControl(method = "repeatedcv", number = 3, repeats = 5, search = "random")

# Test 6 random hyperparameter combinations
tictoc::tic()
nn_model_voters_big_grid <- caret::train(turnout16_2016 ~ ., data = voters_train_data, method = "nnet", 
                                         trControl = fitControl, verbose = FALSE, tuneLength = 6
                                         )
## # weights:  780
## initial  value 42.095991 
## iter  10 value 11.477256
## iter  20 value 0.192048
## iter  30 value 0.139685
## iter  40 value 0.115191
## iter  50 value 0.094285
## iter  60 value 0.078501
## iter  70 value 0.072942
## iter  80 value 0.068058
## iter  90 value 0.065069
## iter 100 value 0.060509
## final  value 0.060509 
## stopped after 100 iterations
## # weights:  42
## initial  value 59.042295 
## iter  10 value 37.432959
## iter  20 value 37.420418
## final  value 37.420383 
## converged
## # weights:  452
## initial  value 553.839156 
## iter  10 value 56.555697
## iter  20 value 37.948157
## iter  30 value 37.459174
## iter  40 value 37.433606
## iter  50 value 37.430055
## iter  60 value 37.429948
## iter  60 value 37.429948
## iter  60 value 37.429948
## final  value 37.429948 
## converged
## # weights:  83
## initial  value 78.146350 
## iter  10 value 37.402337
## iter  20 value 37.302006
## iter  30 value 37.301722
## final  value 37.301721 
## converged
## # weights:  165
## initial  value 37.604154 
## iter  10 value 15.979035
## iter  20 value 7.272046
## iter  30 value 3.830592
## iter  40 value 1.933663
## iter  50 value 0.387161
## iter  60 value 0.341547
## iter  70 value 0.284280
## iter  80 value 0.249284
## iter  90 value 0.200176
## iter 100 value 0.163334
## final  value 0.163334 
## stopped after 100 iterations
## # weights:  616
## initial  value 44.262407 
## iter  10 value 18.091032
## iter  20 value 7.992399
## iter  30 value 3.862796
## iter  40 value 2.793425
## iter  50 value 2.491309
## iter  60 value 2.376890
## iter  70 value 2.306483
## iter  80 value 2.279268
## iter  90 value 2.238432
## iter 100 value 2.216990
## final  value 2.216990 
## stopped after 100 iterations
## # weights:  780
## initial  value 37.722928 
## iter  10 value 15.819151
## iter  20 value 1.354592
## iter  30 value 0.210547
## iter  40 value 0.144478
## iter  50 value 0.134844
## iter  60 value 0.122631
## iter  70 value 0.101942
## iter  80 value 0.091843
## iter  90 value 0.081159
## iter 100 value 0.072211
## final  value 0.072211 
## stopped after 100 iterations
## # weights:  42
## initial  value 61.134041 
## iter  10 value 35.945541
## iter  20 value 35.812616
## final  value 35.812590 
## converged
## # weights:  452
## initial  value 585.216109 
## iter  10 value 38.436536
## iter  20 value 36.137038
## iter  30 value 36.047457
## iter  40 value 36.043713
## final  value 36.043653 
## converged
## # weights:  83
## initial  value 79.438600 
## iter  10 value 35.704312
## iter  20 value 35.354481
## iter  30 value 35.352628
## final  value 35.352627 
## converged
## # weights:  165
## initial  value 36.013564 
## iter  10 value 17.015511
## iter  20 value 5.273321
## iter  30 value 2.992541
## iter  40 value 2.951417
## iter  50 value 2.819570
## iter  60 value 0.724280
## iter  70 value 0.270301
## iter  80 value 0.231933
## iter  90 value 0.203429
## iter 100 value 0.163522
## final  value 0.163522 
## stopped after 100 iterations
## # weights:  616
## initial  value 47.992166 
## iter  10 value 16.954806
## iter  20 value 6.644326
## iter  30 value 3.594294
## iter  40 value 2.691986
## iter  50 value 2.467660
## iter  60 value 2.333128
## iter  70 value 2.257916
## iter  80 value 2.215114
## iter  90 value 2.193699
## iter 100 value 2.185256
## final  value 2.185256 
## stopped after 100 iterations
## # weights:  780
## initial  value 39.163505 
## iter  10 value 10.052538
## iter  20 value 0.224281
## iter  30 value 0.134141
## iter  40 value 0.103848
## iter  50 value 0.081907
## iter  60 value 0.070483
## iter  70 value 0.062302
## iter  80 value 0.056113
## iter  90 value 0.050826
## iter 100 value 0.047241
## final  value 0.047241 
## stopped after 100 iterations
## # weights:  42
## initial  value 63.990222 
## iter  10 value 37.426187
## iter  20 value 37.303877
## final  value 37.303437 
## converged
## # weights:  452
## initial  value 575.470315 
## iter  10 value 43.960207
## iter  20 value 37.619610
## iter  30 value 37.433908
## iter  40 value 37.430696
## iter  50 value 37.429952
## final  value 37.429948 
## converged
## # weights:  83
## initial  value 83.195204 
## iter  10 value 37.901089
## iter  20 value 37.044286
## iter  30 value 36.985201
## final  value 36.985143 
## converged
## # weights:  165
## initial  value 39.061368 
## iter  10 value 22.181704
## iter  20 value 10.718664
## iter  30 value 5.223358
## iter  40 value 2.843380
## iter  50 value 2.548575
## iter  60 value 2.513813
## iter  70 value 2.468511
## iter  80 value 2.162969
## iter  90 value 1.713318
## iter 100 value 0.476445
## final  value 0.476445 
## stopped after 100 iterations
## # weights:  616
## initial  value 39.310513 
## iter  10 value 21.235338
## iter  20 value 7.409032
## iter  30 value 3.707935
## iter  40 value 2.933656
## iter  50 value 2.589250
## iter  60 value 2.434121
## iter  70 value 2.307593
## iter  80 value 2.245299
## iter  90 value 2.215206
## iter 100 value 2.195757
## final  value 2.195757 
## stopped after 100 iterations
## # weights:  780
## initial  value 45.390625 
## iter  10 value 1.260303
## iter  20 value 0.111025
## iter  30 value 0.096680
## iter  40 value 0.085225
## iter  50 value 0.067893
## iter  60 value 0.056329
## iter  70 value 0.050509
## iter  80 value 0.045782
## iter  90 value 0.042122
## iter 100 value 0.039743
## final  value 0.039743 
## stopped after 100 iterations
## # weights:  42
## initial  value 63.371226 
## iter  10 value 36.791974
## iter  20 value 36.691666
## iter  30 value 36.641830
## final  value 36.641786 
## converged
## # weights:  452
## initial  value 602.935601 
## iter  10 value 39.338365
## iter  20 value 36.775330
## iter  30 value 36.729462
## iter  40 value 36.729128
## final  value 36.729122 
## converged
## # weights:  83
## initial  value 102.375607 
## iter  10 value 36.798850
## iter  20 value 36.429710
## iter  30 value 36.427117
## final  value 36.427116 
## converged
## # weights:  165
## initial  value 36.625031 
## iter  10 value 14.846930
## iter  20 value 1.990558
## iter  30 value 0.333866
## iter  40 value 0.294679
## iter  50 value 0.262435
## iter  60 value 0.231028
## iter  70 value 0.207296
## iter  80 value 0.186618
## iter  90 value 0.161398
## iter 100 value 0.124395
## final  value 0.124395 
## stopped after 100 iterations
## # weights:  616
## initial  value 45.637851 
## iter  10 value 12.687417
## iter  20 value 4.155737
## iter  30 value 2.378098
## iter  40 value 2.013474
## iter  50 value 1.878404
## iter  60 value 1.812370
## iter  70 value 1.780197
## iter  80 value 1.770715
## iter  90 value 1.764248
## iter 100 value 1.758322
## final  value 1.758322 
## stopped after 100 iterations
## # weights:  780
## initial  value 42.819396 
## iter  10 value 12.987554
## iter  20 value 0.417699
## iter  30 value 0.268359
## iter  40 value 0.163308
## iter  50 value 0.121675
## iter  60 value 0.102754
## iter  70 value 0.094386
## iter  80 value 0.089171
## iter  90 value 0.081589
## iter 100 value 0.077903
## final  value 0.077903 
## stopped after 100 iterations
## # weights:  42
## initial  value 59.457067 
## iter  10 value 36.719317
## iter  20 value 36.512862
## final  value 36.512485 
## converged
## # weights:  452
## initial  value 641.178198 
## iter  10 value 38.317632
## iter  20 value 36.843606
## iter  30 value 36.735746
## iter  40 value 36.729089
## iter  50 value 36.728877
## final  value 36.728877 
## converged
## # weights:  83
## initial  value 74.352054 
## iter  10 value 36.462081
## iter  20 value 36.160669
## final  value 36.159809 
## converged
## # weights:  165
## initial  value 38.662104 
## iter  10 value 12.629428
## iter  20 value 6.705331
## iter  30 value 5.860436
## iter  40 value 4.651619
## iter  50 value 4.537701
## iter  60 value 2.872848
## iter  70 value 0.706095
## iter  80 value 0.453485
## iter  90 value 0.364220
## iter 100 value 0.314568
## final  value 0.314568 
## stopped after 100 iterations
## # weights:  616
## initial  value 40.009335 
## iter  10 value 14.087456
## iter  20 value 5.617005
## iter  30 value 3.152658
## iter  40 value 2.513647
## iter  50 value 2.353026
## iter  60 value 2.306978
## iter  70 value 2.266536
## iter  80 value 2.239816
## iter  90 value 2.228624
## iter 100 value 2.218696
## final  value 2.218696 
## stopped after 100 iterations
## # weights:  780
## initial  value 39.005315 
## iter  10 value 14.769652
## iter  20 value 1.739485
## iter  30 value 0.238076
## iter  40 value 0.186497
## iter  50 value 0.150203
## iter  60 value 0.113830
## iter  70 value 0.096412
## iter  80 value 0.083231
## iter  90 value 0.074761
## iter 100 value 0.069245
## final  value 0.069245 
## stopped after 100 iterations
## # weights:  42
## initial  value 55.825909 
## iter  10 value 37.452869
## iter  20 value 37.332664
## iter  30 value 37.328823
## final  value 37.328821 
## converged
## # weights:  452
## initial  value 573.254838 
## iter  10 value 38.786598
## iter  20 value 37.459129
## iter  30 value 37.431030
## iter  40 value 37.429958
## final  value 37.429948 
## converged
## # weights:  83
## initial  value 79.483393 
## iter  10 value 37.477852
## iter  20 value 37.050537
## iter  30 value 37.049302
## final  value 37.049265 
## converged
## # weights:  165
## initial  value 48.578017 
## iter  10 value 29.702696
## iter  20 value 16.419372
## iter  30 value 14.992210
## iter  40 value 14.075778
## iter  50 value 13.281265
## iter  60 value 11.341777
## iter  70 value 10.877987
## iter  80 value 10.792720
## iter  90 value 10.374044
## iter 100 value 9.186177
## final  value 9.186177 
## stopped after 100 iterations
## # weights:  616
## initial  value 62.705898 
## iter  10 value 15.210731
## iter  20 value 6.657819
## iter  30 value 3.558933
## iter  40 value 2.588783
## iter  50 value 2.412344
## iter  60 value 2.339109
## iter  70 value 2.252255
## iter  80 value 2.203906
## iter  90 value 2.168884
## iter 100 value 2.149394
## final  value 2.149394 
## stopped after 100 iterations
## # weights:  780
## initial  value 40.292120 
## iter  10 value 14.246517
## iter  20 value 2.989623
## iter  30 value 0.240685
## iter  40 value 0.200952
## iter  50 value 0.134743
## iter  60 value 0.098611
## iter  70 value 0.078535
## iter  80 value 0.070926
## iter  90 value 0.062139
## iter 100 value 0.055763
## final  value 0.055763 
## stopped after 100 iterations
## # weights:  42
## initial  value 59.897852 
## iter  10 value 36.814750
## iter  20 value 36.681968
## iter  30 value 36.422000
## iter  40 value 36.421020
## iter  40 value 36.421020
## iter  40 value 36.421020
## final  value 36.421020 
## converged
## # weights:  452
## initial  value 610.498144 
## iter  10 value 48.291560
## iter  20 value 37.874822
## iter  30 value 36.769871
## iter  40 value 36.732825
## iter  50 value 36.729117
## iter  60 value 36.728756
## iter  60 value 36.728756
## iter  60 value 36.728756
## final  value 36.728756 
## converged
## # weights:  83
## initial  value 79.112134 
## iter  10 value 37.170455
## iter  20 value 35.844240
## iter  30 value 35.771595
## final  value 35.771550 
## converged
## # weights:  165
## initial  value 48.479803 
## iter  10 value 26.887564
## iter  20 value 14.959194
## iter  30 value 11.956342
## iter  40 value 11.930216
## iter  50 value 11.897640
## iter  60 value 11.879147
## iter  70 value 11.739201
## iter  80 value 9.742511
## iter  90 value 9.725574
## iter 100 value 9.639212
## final  value 9.639212 
## stopped after 100 iterations
## # weights:  616
## initial  value 39.624363 
## iter  10 value 17.050045
## iter  20 value 4.846687
## iter  30 value 2.799602
## iter  40 value 2.309892
## iter  50 value 2.145687
## iter  60 value 2.062018
## iter  70 value 2.014823
## iter  80 value 1.989463
## iter  90 value 1.977966
## iter 100 value 1.972199
## final  value 1.972199 
## stopped after 100 iterations
## # weights:  780
## initial  value 40.977390 
## iter  10 value 9.840887
## iter  20 value 1.011521
## iter  30 value 0.246611
## iter  40 value 0.188889
## iter  50 value 0.138214
## iter  60 value 0.118413
## iter  70 value 0.099929
## iter  80 value 0.082123
## iter  90 value 0.072491
## iter 100 value 0.066546
## final  value 0.066546 
## stopped after 100 iterations
## # weights:  42
## initial  value 58.763565 
## iter  10 value 36.715493
## iter  20 value 36.709613
## final  value 36.709611 
## converged
## # weights:  452
## initial  value 670.925860 
## iter  10 value 64.326356
## iter  20 value 37.073699
## iter  30 value 36.738858
## iter  40 value 36.730563
## iter  50 value 36.729230
## final  value 36.729224 
## converged
## # weights:  83
## initial  value 76.191688 
## iter  10 value 36.708437
## iter  20 value 36.677110
## iter  30 value 36.635095
## final  value 36.634998 
## converged
## # weights:  165
## initial  value 38.448315 
## iter  10 value 36.746551
## iter  20 value 36.525825
## iter  30 value 16.480157
## iter  40 value 14.116140
## iter  50 value 12.029902
## iter  60 value 11.884332
## iter  70 value 11.874378
## iter  80 value 11.866986
## iter  90 value 11.629927
## iter 100 value 2.017779
## final  value 2.017779 
## stopped after 100 iterations
## # weights:  616
## initial  value 40.134269 
## iter  10 value 13.293328
## iter  20 value 4.957187
## iter  30 value 3.095058
## iter  40 value 2.538831
## iter  50 value 2.379299
## iter  60 value 2.303894
## iter  70 value 2.259849
## iter  80 value 2.232007
## iter  90 value 2.218373
## iter 100 value 2.206641
## final  value 2.206641 
## stopped after 100 iterations
## # weights:  780
## initial  value 50.393083 
## iter  10 value 10.808405
## iter  20 value 3.142951
## iter  30 value 0.978622
## iter  40 value 0.205693
## iter  50 value 0.187392
## iter  60 value 0.165499
## iter  70 value 0.136085
## iter  80 value 0.120482
## iter  90 value 0.102364
## iter 100 value 0.090728
## final  value 0.090728 
## stopped after 100 iterations
## # weights:  42
## initial  value 56.129598 
## iter  10 value 37.429995
## iter  20 value 37.341543
## final  value 37.340761 
## converged
## # weights:  452
## initial  value 580.494720 
## iter  10 value 41.243636
## iter  20 value 37.529179
## iter  30 value 37.435238
## iter  40 value 37.430108
## final  value 37.429948 
## converged
## # weights:  83
## initial  value 76.715138 
## iter  10 value 37.414601
## iter  20 value 37.209138
## iter  30 value 37.093629
## final  value 37.093595 
## converged
## # weights:  165
## initial  value 36.739867 
## iter  10 value 18.895476
## iter  20 value 2.836283
## iter  30 value 2.433627
## iter  40 value 0.599902
## iter  50 value 0.412432
## iter  60 value 0.373373
## iter  70 value 0.310848
## iter  80 value 0.236368
## iter  90 value 0.207126
## iter 100 value 0.194429
## final  value 0.194429 
## stopped after 100 iterations
## # weights:  616
## initial  value 39.182652 
## iter  10 value 13.391480
## iter  20 value 5.361618
## iter  30 value 2.914686
## iter  40 value 2.591341
## iter  50 value 2.460647
## iter  60 value 2.374053
## iter  70 value 2.332764
## iter  80 value 2.295253
## iter  90 value 2.278175
## iter 100 value 2.271961
## final  value 2.271961 
## stopped after 100 iterations
## # weights:  780
## initial  value 37.461655 
## iter  10 value 16.307209
## iter  20 value 0.470690
## iter  30 value 0.147915
## iter  40 value 0.121580
## iter  50 value 0.099273
## iter  60 value 0.086040
## iter  70 value 0.078263
## iter  80 value 0.071824
## iter  90 value 0.066887
## iter 100 value 0.061589
## final  value 0.061589 
## stopped after 100 iterations
## # weights:  42
## initial  value 59.536250 
## iter  10 value 36.766669
## iter  20 value 36.731188
## iter  30 value 36.716264
## final  value 36.716235 
## converged
## # weights:  452
## initial  value 603.217414 
## iter  10 value 50.097127
## iter  20 value 37.369234
## iter  30 value 36.741135
## iter  40 value 36.730330
## iter  50 value 36.729249
## final  value 36.729222 
## converged
## # weights:  83
## initial  value 78.919273 
## iter  10 value 36.785511
## iter  20 value 36.714335
## iter  30 value 36.674450
## iter  40 value 36.665749
## final  value 36.665741 
## converged
## # weights:  165
## initial  value 45.265427 
## iter  10 value 27.334581
## iter  20 value 17.481668
## iter  30 value 15.592648
## iter  40 value 11.155156
## iter  50 value 4.904583
## iter  60 value 3.921931
## iter  70 value 3.116526
## iter  80 value 3.028310
## iter  90 value 3.013188
## iter 100 value 2.972920
## final  value 2.972920 
## stopped after 100 iterations
## # weights:  616
## initial  value 58.270668 
## iter  10 value 22.796252
## iter  20 value 9.279852
## iter  30 value 5.206134
## iter  40 value 3.811898
## iter  50 value 3.244092
## iter  60 value 3.074384
## iter  70 value 3.004659
## iter  80 value 2.960947
## iter  90 value 2.922234
## iter 100 value 2.897032
## final  value 2.897032 
## stopped after 100 iterations
## # weights:  780
## initial  value 41.376552 
## iter  10 value 14.468082
## iter  20 value 0.351423
## iter  30 value 0.137733
## iter  40 value 0.112835
## iter  50 value 0.089230
## iter  60 value 0.075916
## iter  70 value 0.068009
## iter  80 value 0.060238
## iter  90 value 0.054936
## iter 100 value 0.049549
## final  value 0.049549 
## stopped after 100 iterations
## # weights:  42
## initial  value 64.261380 
## iter  10 value 36.415336
## iter  20 value 36.350419
## final  value 36.350404 
## converged
## # weights:  452
## initial  value 575.285191 
## iter  10 value 45.325764
## iter  20 value 38.035560
## iter  30 value 36.774493
## iter  40 value 36.733352
## iter  50 value 36.728795
## final  value 36.728681 
## converged
## # weights:  83
## initial  value 75.229765 
## iter  10 value 36.378772
## iter  20 value 35.660524
## final  value 35.659431 
## converged
## # weights:  165
## initial  value 42.292548 
## iter  10 value 20.652713
## iter  20 value 7.562533
## iter  30 value 4.414666
## iter  40 value 4.389132
## iter  50 value 4.380923
## iter  60 value 4.363783
## iter  70 value 4.354804
## iter  80 value 4.349484
## iter  90 value 4.347580
## iter 100 value 4.346070
## final  value 4.346070 
## stopped after 100 iterations
## # weights:  616
## initial  value 51.161062 
## iter  10 value 13.850106
## iter  20 value 5.969366
## iter  30 value 3.075719
## iter  40 value 2.350861
## iter  50 value 2.082022
## iter  60 value 1.934655
## iter  70 value 1.884197
## iter  80 value 1.865850
## iter  90 value 1.854994
## iter 100 value 1.838942
## final  value 1.838942 
## stopped after 100 iterations
## # weights:  780
## initial  value 43.836819 
## iter  10 value 20.071372
## iter  20 value 7.983087
## iter  30 value 2.659602
## iter  40 value 2.597720
## iter  50 value 1.908211
## iter  60 value 0.290933
## iter  70 value 0.254480
## iter  80 value 0.233417
## iter  90 value 0.205674
## iter 100 value 0.175149
## final  value 0.175149 
## stopped after 100 iterations
## # weights:  42
## initial  value 58.025602 
## iter  10 value 37.371499
## iter  20 value 37.315864
## final  value 37.315229 
## converged
## # weights:  452
## initial  value 609.672321 
## iter  10 value 39.228104
## iter  20 value 37.523704
## iter  30 value 37.434036
## iter  40 value 37.429962
## final  value 37.429948 
## converged
## # weights:  83
## initial  value 78.301878 
## iter  10 value 37.420793
## iter  20 value 37.288175
## iter  30 value 37.284949
## final  value 37.284781 
## converged
## # weights:  165
## initial  value 46.736831 
## iter  10 value 11.433514
## iter  20 value 1.717475
## iter  30 value 0.281743
## iter  40 value 0.254186
## iter  50 value 0.216932
## iter  60 value 0.188539
## iter  70 value 0.170083
## iter  80 value 0.158696
## iter  90 value 0.148639
## iter 100 value 0.137387
## final  value 0.137387 
## stopped after 100 iterations
## # weights:  616
## initial  value 37.445786 
## iter  10 value 12.560420
## iter  20 value 6.701402
## iter  30 value 3.593111
## iter  40 value 2.797682
## iter  50 value 2.421579
## iter  60 value 2.316518
## iter  70 value 2.257755
## iter  80 value 2.230315
## iter  90 value 2.219032
## iter 100 value 2.214059
## final  value 2.214059 
## stopped after 100 iterations
## # weights:  780
## initial  value 41.840738 
## iter  10 value 16.434277
## iter  20 value 6.488681
## iter  30 value 1.876942
## iter  40 value 0.765006
## iter  50 value 0.291335
## iter  60 value 0.269471
## iter  70 value 0.232827
## iter  80 value 0.208372
## iter  90 value 0.193377
## iter 100 value 0.167532
## final  value 0.167532 
## stopped after 100 iterations
## # weights:  42
## initial  value 66.116434 
## iter  10 value 38.314290
## iter  20 value 37.356091
## iter  30 value 37.332756
## final  value 37.332748 
## converged
## # weights:  452
## initial  value 573.463639 
## iter  10 value 48.728211
## iter  20 value 38.100876
## iter  30 value 37.467570
## iter  40 value 37.432047
## iter  50 value 37.430094
## final  value 37.429948 
## converged
## # weights:  83
## initial  value 92.675417 
## iter  10 value 37.331134
## iter  20 value 37.071593
## iter  30 value 37.063970
## final  value 37.063966 
## converged
## # weights:  165
## initial  value 49.903378 
## iter  10 value 34.376922
## iter  20 value 16.148617
## iter  30 value 13.953906
## iter  40 value 13.932766
## iter  50 value 13.920510
## iter  60 value 13.910534
## iter  70 value 12.813669
## iter  80 value 7.528844
## iter  90 value 5.982163
## iter 100 value 5.961699
## final  value 5.961699 
## stopped after 100 iterations
## # weights:  616
## initial  value 36.471378 
## iter  10 value 14.887129
## iter  20 value 5.583078
## iter  30 value 3.294750
## iter  40 value 2.688810
## iter  50 value 2.525001
## iter  60 value 2.441646
## iter  70 value 2.392923
## iter  80 value 2.377739
## iter  90 value 2.366463
## iter 100 value 2.357317
## final  value 2.357317 
## stopped after 100 iterations
## # weights:  780
## initial  value 46.061865 
## iter  10 value 10.661556
## iter  20 value 0.336436
## iter  30 value 0.131546
## iter  40 value 0.100317
## iter  50 value 0.078902
## iter  60 value 0.068074
## iter  70 value 0.061313
## iter  80 value 0.055716
## iter  90 value 0.050316
## iter 100 value 0.045996
## final  value 0.045996 
## stopped after 100 iterations
## # weights:  42
## initial  value 56.415398 
## iter  10 value 37.344913
## iter  20 value 37.206354
## final  value 37.206239 
## converged
## # weights:  452
## initial  value 624.089273 
## iter  10 value 43.671775
## iter  20 value 37.869603
## iter  30 value 37.458436
## iter  40 value 37.430260
## iter  50 value 37.429948
## iter  50 value 37.429948
## iter  50 value 37.429948
## final  value 37.429948 
## converged
## # weights:  83
## initial  value 78.974910 
## iter  10 value 37.195116
## iter  20 value 37.159180
## iter  30 value 36.984115
## iter  40 value 36.731293
## final  value 36.730514 
## converged
## # weights:  165
## initial  value 38.209351 
## iter  10 value 25.539625
## iter  20 value 13.896425
## iter  30 value 13.567873
## iter  40 value 12.973842
## iter  50 value 12.764897
## iter  60 value 11.392946
## iter  70 value 9.747052
## iter  80 value 8.266110
## iter  90 value 5.722719
## iter 100 value 4.131324
## final  value 4.131324 
## stopped after 100 iterations
## # weights:  616
## initial  value 48.017027 
## iter  10 value 17.883304
## iter  20 value 7.111944
## iter  30 value 3.360980
## iter  40 value 2.644175
## iter  50 value 2.287542
## iter  60 value 2.157949
## iter  70 value 2.120883
## iter  80 value 2.094691
## iter  90 value 2.081399
## iter 100 value 2.070873
## final  value 2.070873 
## stopped after 100 iterations
## # weights:  780
## initial  value 50.236212 
## iter  10 value 15.843816
## iter  20 value 8.465304
## iter  30 value 8.338156
## iter  40 value 8.108938
## iter  50 value 4.070357
## iter  60 value 1.649383
## iter  70 value 1.600907
## iter  80 value 1.571623
## iter  90 value 1.542905
## iter 100 value 1.515006
## final  value 1.515006 
## stopped after 100 iterations
## # weights:  42
## initial  value 64.847866 
## iter  10 value 35.985667
## iter  20 value 35.971761
## final  value 35.971755 
## converged
## # weights:  452
## initial  value 626.691028 
## iter  10 value 40.403666
## iter  20 value 36.110199
## iter  30 value 36.045113
## iter  40 value 36.043671
## final  value 36.043653 
## converged
## # weights:  83
## initial  value 78.174070 
## iter  10 value 37.042184
## iter  20 value 35.800497
## iter  30 value 35.742156
## iter  40 value 35.741814
## iter  40 value 35.741813
## iter  40 value 35.741813
## final  value 35.741813 
## converged
## # weights:  165
## initial  value 35.724099 
## iter  10 value 15.567152
## iter  20 value 2.438302
## iter  30 value 0.423007
## iter  40 value 0.350050
## iter  50 value 0.298933
## iter  60 value 0.236935
## iter  70 value 0.199693
## iter  80 value 0.162711
## iter  90 value 0.133477
## iter 100 value 0.113916
## final  value 0.113916 
## stopped after 100 iterations
## # weights:  616
## initial  value 55.016857 
## iter  10 value 16.032075
## iter  20 value 5.700726
## iter  30 value 2.723938
## iter  40 value 2.283463
## iter  50 value 2.129382
## iter  60 value 2.028967
## iter  70 value 1.997222
## iter  80 value 1.989880
## iter  90 value 1.985893
## iter 100 value 1.980308
## final  value 1.980308 
## stopped after 100 iterations
## # weights:  83
## initial  value 95.734274 
## iter  10 value 54.917833
## iter  20 value 54.369808
## iter  30 value 54.179707
## iter  40 value 54.154820
## final  value 54.154816 
## converged
tictoc::toc()
## 10.8 sec elapsed
# Define trainControl function
fitControl <- caret::trainControl(method="adaptive_cv", number = 3, repeats = 3)

# Define trainControl function
fitControl <- caret::trainControl(method = "adaptive_cv", number = 3, repeats = 3, search="random")

# Define trainControl function
fitControl <- caret::trainControl(method = "adaptive_cv", number = 3, repeats = 3,
                                  adaptive = list(min=3, alpha = 0.05, method = "BT", complete = FALSE),
                                  search = "random"
                                  )

# Start timer & train model
tictoc::tic()
svm_model_voters_ar <- caret::train(turnout16_2016 ~ ., data = voters_train_data, method = "nnet", 
                                    trControl = fitControl, verbose = FALSE, tuneLength = 6
                                    )
## 
## Attaching package: 'nnet'
## The following object is masked from 'package:mgcv':
## 
##     multinom
## # weights:  124
## initial  value 44.275967 
## iter  10 value 36.440463
## iter  20 value 23.082035
## iter  30 value 21.964016
## iter  40 value 21.761035
## iter  50 value 21.755784
## iter  60 value 21.753926
## iter  70 value 21.753573
## final  value 21.753573 
## converged
## # weights:  370
## initial  value 37.951128 
## iter  10 value 16.361507
## iter  20 value 5.417270
## iter  30 value 2.349817
## iter  40 value 2.274324
## iter  50 value 2.233795
## iter  60 value 2.182737
## iter  70 value 2.142867
## iter  80 value 2.102816
## iter  90 value 2.018446
## iter 100 value 1.160599
## final  value 1.160599 
## stopped after 100 iterations
## # weights:  657
## initial  value 498.973531 
## iter  10 value 64.312510
## iter  20 value 37.068441
## iter  30 value 36.755749
## iter  40 value 36.737270
## iter  50 value 36.729396
## iter  60 value 36.727778
## final  value 36.727700 
## converged
## # weights:  534
## initial  value 38.918257 
## iter  10 value 22.265881
## iter  20 value 5.040982
## iter  30 value 2.795378
## iter  40 value 2.204134
## iter  50 value 1.005775
## iter  60 value 0.753512
## iter  70 value 0.577415
## iter  80 value 0.499433
## iter  90 value 0.460771
## iter 100 value 0.431809
## final  value 0.431809 
## stopped after 100 iterations
## # weights:  698
## initial  value 42.196882 
## iter  10 value 23.258456
## iter  20 value 9.731200
## iter  30 value 4.632739
## iter  40 value 2.811316
## iter  50 value 1.774158
## iter  60 value 1.593068
## iter  70 value 1.527281
## iter  80 value 1.453013
## iter  90 value 1.377536
## iter 100 value 1.343026
## final  value 1.343026 
## stopped after 100 iterations
## # weights:  739
## initial  value 80.972030 
## iter  10 value 13.258808
## iter  20 value 2.454603
## iter  30 value 1.194309
## iter  40 value 0.828264
## iter  50 value 0.684927
## iter  60 value 0.623850
## iter  70 value 0.599861
## iter  80 value 0.585994
## iter  90 value 0.570180
## iter 100 value 0.560652
## final  value 0.560652 
## stopped after 100 iterations
## # weights:  124
## initial  value 44.588322 
## iter  10 value 34.893865
## iter  20 value 24.320215
## iter  30 value 18.790891
## iter  40 value 18.303341
## iter  50 value 18.224331
## iter  60 value 18.213048
## iter  70 value 18.207502
## final  value 18.207496 
## converged
## # weights:  370
## initial  value 47.570546 
## iter  10 value 16.794622
## iter  20 value 7.331124
## iter  30 value 0.339760
## iter  40 value 0.223463
## iter  50 value 0.186962
## iter  60 value 0.167297
## iter  70 value 0.157758
## iter  80 value 0.153938
## iter  90 value 0.149573
## iter 100 value 0.143599
## final  value 0.143599 
## stopped after 100 iterations
## # weights:  657
## initial  value 518.008237 
## iter  10 value 42.345207
## iter  20 value 37.785224
## iter  30 value 36.768585
## iter  40 value 36.729057
## iter  50 value 36.725311
## iter  60 value 36.724471
## iter  70 value 36.724320
## iter  80 value 36.724246
## final  value 36.724228 
## converged
## # weights:  534
## initial  value 35.907225 
## iter  10 value 16.364251
## iter  20 value 0.686190
## iter  30 value 0.485076
## iter  40 value 0.358021
## iter  50 value 0.313532
## iter  60 value 0.294566
## iter  70 value 0.285578
## iter  80 value 0.279374
## iter  90 value 0.271242
## iter 100 value 0.264507
## final  value 0.264507 
## stopped after 100 iterations
## # weights:  698
## initial  value 38.192191 
## iter  10 value 12.833944
## iter  20 value 4.433109
## iter  30 value 2.191992
## iter  40 value 1.566489
## iter  50 value 1.251073
## iter  60 value 1.126693
## iter  70 value 1.080244
## iter  80 value 1.052277
## iter  90 value 1.030704
## iter 100 value 1.020197
## final  value 1.020197 
## stopped after 100 iterations
## # weights:  739
## initial  value 56.633001 
## iter  10 value 14.671303
## iter  20 value 3.914868
## iter  30 value 1.794527
## iter  40 value 1.204729
## iter  50 value 0.768333
## iter  60 value 0.657568
## iter  70 value 0.565811
## iter  80 value 0.489853
## iter  90 value 0.448023
## iter 100 value 0.424290
## final  value 0.424290 
## stopped after 100 iterations
## # weights:  124
## initial  value 46.134071 
## iter  10 value 27.182976
## iter  20 value 17.759840
## iter  30 value 16.349265
## iter  40 value 16.041055
## iter  50 value 16.032849
## final  value 16.032839 
## converged
## # weights:  370
## initial  value 56.619136 
## iter  10 value 13.147020
## iter  20 value 4.516580
## iter  30 value 4.423254
## iter  40 value 0.255075
## iter  50 value 0.141184
## iter  60 value 0.116661
## iter  70 value 0.100472
## iter  80 value 0.094996
## iter  90 value 0.084593
## iter 100 value 0.074915
## final  value 0.074915 
## stopped after 100 iterations
## # weights:  657
## initial  value 542.718439 
## iter  10 value 41.754015
## iter  20 value 37.258986
## iter  30 value 37.076908
## iter  40 value 37.052470
## iter  50 value 37.049151
## iter  60 value 37.048067
## iter  70 value 37.047686
## iter  80 value 37.047559
## iter  80 value 37.047559
## iter  80 value 37.047559
## final  value 37.047559 
## converged
## # weights:  534
## initial  value 43.062723 
## iter  10 value 10.603639
## iter  20 value 0.642486
## iter  30 value 0.468601
## iter  40 value 0.333013
## iter  50 value 0.271886
## iter  60 value 0.249356
## iter  70 value 0.233858
## iter  80 value 0.216051
## iter  90 value 0.202799
## iter 100 value 0.197191
## final  value 0.197191 
## stopped after 100 iterations
## # weights:  698
## initial  value 50.038605 
## iter  10 value 13.446323
## iter  20 value 2.858862
## iter  30 value 1.591871
## iter  40 value 1.209430
## iter  50 value 1.041773
## iter  60 value 0.933315
## iter  70 value 0.884087
## iter  80 value 0.844380
## iter  90 value 0.807439
## iter 100 value 0.787761
## final  value 0.787761 
## stopped after 100 iterations
## # weights:  739
## initial  value 45.066974 
## iter  10 value 11.716800
## iter  20 value 3.408476
## iter  30 value 2.021217
## iter  40 value 1.051264
## iter  50 value 0.622654
## iter  60 value 0.457695
## iter  70 value 0.388328
## iter  80 value 0.352778
## iter  90 value 0.332328
## iter 100 value 0.321958
## final  value 0.321958 
## stopped after 100 iterations
## Loading required namespace: BradleyTerry2
## Warning in eval(family$initialize): non-integer counts in a binomial glm!
## # weights:  124
## initial  value 52.061078 
## iter  10 value 30.608033
## iter  20 value 23.645003
## iter  30 value 22.431176
## iter  40 value 20.342526
## iter  50 value 20.140440
## iter  60 value 20.038633
## iter  70 value 19.961808
## iter  80 value 19.531554
## iter  90 value 19.427643
## final  value 19.424994 
## converged
## # weights:  370
## initial  value 38.288446 
## iter  10 value 15.952413
## iter  20 value 8.451919
## iter  30 value 6.168426
## iter  40 value 6.098642
## iter  50 value 5.595655
## iter  60 value 4.433657
## iter  70 value 4.394847
## iter  80 value 3.606427
## iter  90 value 3.569006
## iter 100 value 3.172624
## final  value 3.172624 
## stopped after 100 iterations
## # weights:  657
## initial  value 542.761320 
## iter  10 value 47.830405
## iter  20 value 38.282873
## iter  30 value 36.871920
## iter  40 value 36.736721
## iter  50 value 36.716103
## iter  60 value 36.713709
## iter  70 value 36.712793
## iter  80 value 36.712613
## final  value 36.712607 
## converged
## # weights:  534
## initial  value 38.010202 
## iter  10 value 10.767306
## iter  20 value 6.393577
## iter  30 value 2.640509
## iter  40 value 1.051689
## iter  50 value 0.786490
## iter  60 value 0.575366
## iter  70 value 0.485198
## iter  80 value 0.424234
## iter  90 value 0.389048
## iter 100 value 0.362869
## final  value 0.362869 
## stopped after 100 iterations
## # weights:  698
## initial  value 68.400422 
## iter  10 value 15.020680
## iter  20 value 4.030552
## iter  30 value 2.259889
## iter  40 value 1.521236
## iter  50 value 1.350497
## iter  60 value 1.269584
## iter  70 value 1.226794
## iter  80 value 1.201477
## iter  90 value 1.184371
## iter 100 value 1.171294
## final  value 1.171294 
## stopped after 100 iterations
## # weights:  739
## initial  value 37.948677 
## iter  10 value 10.517782
## iter  20 value 1.164265
## iter  30 value 0.763354
## iter  40 value 0.574075
## iter  50 value 0.527088
## iter  60 value 0.507957
## iter  70 value 0.498783
## iter  80 value 0.492070
## iter  90 value 0.484097
## iter 100 value 0.476430
## final  value 0.476430 
## stopped after 100 iterations
## Warning in eval(family$initialize): non-integer counts in a binomial glm!
## # weights:  534
## initial  value 41.150937 
## iter  10 value 10.999895
## iter  20 value 1.063361
## iter  30 value 0.675099
## iter  40 value 0.512373
## iter  50 value 0.434325
## iter  60 value 0.399739
## iter  70 value 0.380227
## iter  80 value 0.368459
## iter  90 value 0.355592
## iter 100 value 0.346002
## final  value 0.346002 
## stopped after 100 iterations
## # weights:  698
## initial  value 91.064393 
## iter  10 value 20.089329
## iter  20 value 4.662915
## iter  30 value 2.081509
## iter  40 value 1.501076
## iter  50 value 1.384736
## iter  60 value 1.329129
## iter  70 value 1.286587
## iter  80 value 1.257335
## iter  90 value 1.231171
## iter 100 value 1.211073
## final  value 1.211073 
## stopped after 100 iterations
## # weights:  739
## initial  value 44.933921 
## iter  10 value 21.920467
## iter  20 value 6.097234
## iter  30 value 2.048703
## iter  40 value 1.122254
## iter  50 value 0.895622
## iter  60 value 0.785474
## iter  70 value 0.694390
## iter  80 value 0.613053
## iter  90 value 0.561664
## iter 100 value 0.533745
## final  value 0.533745 
## stopped after 100 iterations
## # weights:  124
## initial  value 48.994143 
## iter  10 value 33.574701
## iter  20 value 26.909900
## iter  30 value 26.001210
## iter  40 value 23.664357
## iter  50 value 21.164330
## iter  60 value 20.646224
## iter  70 value 20.614152
## iter  80 value 20.600616
## final  value 20.600604 
## converged
## Warning in eval(family$initialize): non-integer counts in a binomial glm!
## # weights:  534
## initial  value 37.838625 
## iter  10 value 15.678357
## iter  20 value 8.854845
## iter  30 value 6.251560
## iter  40 value 2.051770
## iter  50 value 1.437367
## iter  60 value 1.055765
## iter  70 value 0.824039
## iter  80 value 0.646349
## iter  90 value 0.519056
## iter 100 value 0.444675
## final  value 0.444675 
## stopped after 100 iterations
## # weights:  698
## initial  value 52.272514 
## iter  10 value 20.258588
## iter  20 value 9.110956
## iter  30 value 4.292440
## iter  40 value 2.182366
## iter  50 value 1.523808
## iter  60 value 1.315219
## iter  70 value 1.246812
## iter  80 value 1.212656
## iter  90 value 1.195883
## iter 100 value 1.187433
## final  value 1.187433 
## stopped after 100 iterations
## # weights:  124
## initial  value 42.446506 
## iter  10 value 24.577876
## iter  20 value 20.582477
## iter  30 value 19.437867
## iter  40 value 19.382912
## iter  50 value 19.378901
## final  value 19.378884 
## converged
## Warning in eval(family$initialize): non-integer counts in a binomial glm!
## # weights:  534
## initial  value 42.326288 
## iter  10 value 10.205576
## iter  20 value 3.252371
## iter  30 value 1.208405
## iter  40 value 0.884720
## iter  50 value 0.636052
## iter  60 value 0.509704
## iter  70 value 0.451380
## iter  80 value 0.391670
## iter  90 value 0.356216
## iter 100 value 0.331170
## final  value 0.331170 
## stopped after 100 iterations
## # weights:  698
## initial  value 97.677119 
## iter  10 value 16.622356
## iter  20 value 8.392211
## iter  30 value 4.057524
## iter  40 value 2.165973
## iter  50 value 1.641724
## iter  60 value 1.441719
## iter  70 value 1.338479
## iter  80 value 1.269219
## iter  90 value 1.216098
## iter 100 value 1.195676
## final  value 1.195676 
## stopped after 100 iterations
## # weights:  124
## initial  value 45.207165 
## iter  10 value 28.675421
## iter  20 value 22.506782
## iter  30 value 20.422020
## iter  40 value 19.712459
## iter  50 value 19.577957
## iter  60 value 19.577109
## final  value 19.577093 
## converged
## Warning in eval(family$initialize): non-integer counts in a binomial glm!
## # weights:  534
## initial  value 37.014117 
## iter  10 value 12.006555
## iter  20 value 3.416121
## iter  30 value 1.247261
## iter  40 value 0.720127
## iter  50 value 0.523285
## iter  60 value 0.443740
## iter  70 value 0.403475
## iter  80 value 0.370679
## iter  90 value 0.345999
## iter 100 value 0.331503
## final  value 0.331503 
## stopped after 100 iterations
## # weights:  698
## initial  value 76.707941 
## iter  10 value 19.093034
## iter  20 value 4.503402
## iter  30 value 2.394702
## iter  40 value 1.696850
## iter  50 value 1.490945
## iter  60 value 1.379735
## iter  70 value 1.317353
## iter  80 value 1.262156
## iter  90 value 1.233331
## iter 100 value 1.214273
## final  value 1.214273 
## stopped after 100 iterations
## Warning in eval(family$initialize): non-integer counts in a binomial glm!
## # weights:  534
## initial  value 55.940776 
## iter  10 value 21.982310
## iter  20 value 4.782215
## iter  30 value 3.249293
## iter  40 value 1.919848
## iter  50 value 0.945792
## iter  60 value 0.739518
## iter  70 value 0.607314
## iter  80 value 0.553013
## iter  90 value 0.529495
## iter 100 value 0.514585
## final  value 0.514585 
## stopped after 100 iterations
tictoc::toc()
## 8.97 sec elapsed

Chapter 3 - Hyperparameter Tuning with mlr

Machine Learning with mlr:

  • The package mlr can be used as a framework for machine learning in R
      1. Define the task
      1. Define the learner
      1. Fit the model
  • There are many possible tasks in mlr
    • RegrTask() for regression
    • ClassifTask() for binary and multi-class classification
    • MultilabelTask() for multi-label classification problems
    • CostSensTask() for general cost-sensitive classification
  • Example of making a classification
    • task <- makeClassifTask(data = knowledge_train_data, target = “UNS”)
    • listLearners()
    • lrn <- makeLearner(“classif.h2o.deeplearning”, fix.factors.prediction = TRUE, predict.type = “prob”)
    • model <- train(lrn, task)

Grid and Random Search with mlr:

  • There are three things to define for hyperparameter tuning using mlr
    • the search space for every hyperparameter
    • the tuning method (e.g. grid or random search)
    • the resampling method
  • Example of defining the search space for every hyperparameter
    • makeParamSet( makeNumericParam(), makeIntegerParam(), makeDiscreteParam(), makeLogicalParam(), makeDiscreteVectorParam() )
    • getParamSet(“classif.h2o.deeplearning”)
    • param_set <- makeParamSet( makeDiscreteParam(“hidden”, values = list(one = 10, two = c(10, 5, 10))), makeDiscreteParam(“activation”, values = c(“Rectifier”, “Tanh”)), makeNumericParam(“l1”, lower = 0.0001, upper = 1), makeNumericParam(“l2”, lower = 0.0001, upper = 1) )
  • Example of tuning - can use either grid search (requires discrete search) or random search (all types)
    • ctrl_grid <- makeTuneControlGrid()
    • ctrl_random <- makeTuneControlRandom()
  • Example of defining the resampling strategy
    • cross_val <- makeResampleDesc(“RepCV”, predict = “both”, folds = 5 * 3)
    • param_set <- makeParamSet( makeDiscreteParam(“mtry”, values = c(2,3,4,5)) )
    • ctrl_grid <- makeTuneControlGrid()
    • task <- makeClassifTask(data = knowledge_train_data, target = “UNS”)
    • lrn <- makeLearner(“classif.h2o.deeplearning”, predict.type = “prob”, fix.factors.prediction = TRUE)
    • lrn_tune <- tuneParams(lrn, task, resampling = cross_val, control = ctrl_grid, par.set = param_set)

Evaluating Hyperparameters with mlr:

  • Generally, the goal of hyperparameter tuning is to understand impact on performance as well as convergence on optimal parameters
  • Recap of the basic process, this time using a holdout sample rather than repeated cross-validation
    • getParamSet(“classif.h2o.deeplearning”)
    • param_set <- makeParamSet( makeDiscreteParam(“hidden”, values = list(one = 10, two = c(10, 5, 10))), makeDiscreteParam(“activation”, values = c(“Rectifier”, “Tanh”)), makeNumericParam(“l1”, lower = 0.0001, upper = 1), makeNumericParam(“l2”, lower = 0.0001, upper = 1) )
    • ctrl_random <- makeTuneControlRandom(maxit = 50)
    • holdout <- makeResampleDesc(“Holdout”)
    • task <- makeClassifTask(data = knowledge_train_data, target = “UNS”)
    • lrn <- makeLearner(“classif.h2o.deeplearning”, predict.type = “prob”, fix.factors.prediction = TRUE)
    • lrn_tune <- tuneParams(lrn, task, resampling = holdout, control = ctrl_random, par.set = param_set)
  • Can see the impact of each of the hyperparameter tunings
    • generateHyperParsEffectData(lrn_tune, partial.dep = TRUE)
    • hyperpar_effects <- generateHyperParsEffectData(lrn_tune, partial.dep = TRUE)
    • plotHyperParsEffect(hyperpar_effects, partial.dep.learn = “regr.randomForest”, x = “l1”, y = “mmce.test.mean”, z = “hidden”, plot.type = “line”)

Advanced Tuning with mlr:

  • There are several advanced tuning capabilities available in mlr
    • makeTuneControlCMAES: CMA Evolution Strategy
    • makeTuneControlDesign: Predefined data frame of hyperparameters
    • makeTuneControlGenSA: Generalized simulated annealing
    • makeTuneControlIrace: Tuning with iterated F-Racing
    • makeTuneControlMBO: Model-based / Bayesian optimization
  • Can choose the evaluation metrics within mlr
    • ctrl_gensa <- makeTuneControlGenSA()
    • bootstrap <- makeResampleDesc(“Bootstrap”, predict = “both”)
    • lrn_tune <- tuneParams(learner = lrn, task = task, resampling = bootstrap, control = ctrl_gensa, par.set = param_set, measures = list(acc, mmce))
    • lrn_tune <- tuneParams(learner = lrn, task = task, resampling = bootstrap, control = ctrl_gensa, par.set = param_set, measures = list(acc, setAggregation(acc, train.mean), mmce, setAggregation(mmce, train.mean)))
  • Can also run using nested cross-validation
    • lrn_wrapper <- makeTuneWrapper(learner = lrn, resampling = bootstrap, control = ctrl_gensa, par.set = param_set, measures = list(acc, mmce))
    • model_nested <- train(lrn_wrapper, task)
    • getTuneResult(model_nested)
    • cv2 <- makeResampleDesc(“CV”, iters = 2)
    • res <- resample(lrn_wrapper, task, resampling = cv2, extract = getTuneResult)
    • generateHyperParsEffectData(res)
    • lrn_best <- setHyperPars(lrn, par.vals = list(minsplit = 4, minbucket = 3, maxdepth = 6))
    • model_best <- train(lrn_best, task)

Example code includes:

vecData <- c(0.08, 0.18, 0.1, 0.12, 0.09, 0.08, 0.2, 0.2, 0.13, 0.18, 0.24, 0.18, 0.31, 0.28, 0.325, 0.323, 0.299, 0.32, 0.329, 0.315, 0.325, 0.325, 0.312, 0.299, 0.48, 0.46, 0.48, 0.49, 0.495, 0.43, 0.4, 0.44, 0.49, 0.44, 0.46, 0.495, 0.49, 0.42, 0.78, 0.85, 0.06, 0.08, 0.2, 0.06, 0.1, 0.15, 0.12, 0.06, 0.15, 0.1, 0.02, 0.09, 0.1, 0.08, 0.09, 0.2, 0.28, 0.265, 0.275, 0.295, 0.32, 0.25, 0.27, 0.27, 0.29, 0.288, 0.255, 0.295, 0.243, 0.295, 0.276, 0.258, 0.28, 0.255, 0.265, 0.255, 0.39, 0.38, 0.37, 0.38, 0.1, 0.1, 0.2, 0.18, 0.1, 0.12, 0.19, 0.14, 0.18, 0.17, 0.1, 0.23, 0.18, 0.2, 0.09, 0.06, 0.15, 0.29, 0.3, 0.27, 0.3, 0.295, 0.29, 0.258, 0.32, 0.3, 0.29, 0.26, 0.305, 0.32, 0.295, 0.285, 0.3, 0.4, 0.4, 0.41, 0.41, 0.44, 0.42, 0.43, 0.08, 0.18, 0.1, 0.12, 0.3, 0.325, 0.45, 0.49, 0.39, 0.34, 0.75, 0.51, 0.1, 0.16, 0.25, 0.32, 0.32, 0.28, 0.55, 0.69, 0.61, 0.9, 0.8, 0.7, 0.12, 0.2, 0.3, 0.245, 0.276, 0.45, 0.33, 0.33, 0.34, 0.55, 0.78, 0.82, 0.9, 0.7, 0.21, 0.05, 0.06, 0.08, 0.14, 0.06, 0.25, 0.32, 0.28, 0.29, 0.295, 0.42, 0.33, 0.55, 0.6, 0.58, 0.61, 0.68, 0.1, 0.06, 0.1, 0.2, 0.12, 0.29, 0.1, 0.31, 0.29, 0.31, 0.305, 0.25, 0.27, 0.29, 0.255, 0.31, 0.65, 0.75, 0.76, 0.72, 0.05, 0.1, 0.06, 0.01, 0.1, 0.1, 0.2, 0.3, 0.27, 0.245, 0.38, 0.49, 0.33, 0.36, 0.39, 0.7, 0.72, 0.52, 0.6, 0.77, 0.79, 0.06, 0.08, 0.12, 0.2, 0.25, 0.3, 0.28, 0.255, 0.27, 0.3, 0.28, 0.255, 0.27, 0.59, 0.64, 0.85, 0.18, 0.12, 0.18, 0.09, 0.08, 0.21, 0.305, 0.1, 0.55, 0.7, 0.75, 0.68, 0.62, 0.28, 0.6, 0.85, 0.71, 0.32, 0.58, 0.41, 0.69, 0.38, 0.89, 0.31, 0.72, 0.02, 0.28, 0.46, 0.52, 0.67, 0.95, 0.28, 0.76, 0.15, 0.38, 0.58, 0.27, 0.12, 0.59, 0.88, 0.11, 0.38, 0.67, 0.52, 0.72, 0.68, 0.91, 0.05, 0.08, 0.35, 0.51, 0.1, 0.05, 0.2, 0.35, 0.75, 0.22, 0.36, 0.12, 0.33, 0.6, 0.53, 0.73, 0.12, 0.57, 0.72, 0.86, 0.79, 0.15, 0.1, 0.32, 0.4, 0.79, 0.86, 0.73, 0.08, 0.31, 0.81, 0.88, 0.4, 0.35, 0.8, 0.72, 0.02, 0.4, 0.32, 0.53, 0.15, 0.52, 0.7, 0.37, 0.31, 0.75, 0.38, 0.55, 0.61, 0.8, 0.75, 0.19, 0.37, 0.36, 0.66, 0.72, 0.78, 0.19, 0.4, 0.37, 0.52, 0.26, 0.52, 0.64, 0.55, 0.31, 0.56, 0.6, 0.63, 0.52, 0.29, 0.18, 0.54, 0.26, 0.41, 0.33, 0.58, 0.8, 0.87, 0.51, 0.24, 0.3, 0.15, 0.35, 0.18, 0.94, 0.31, 0.2, 0.38, 0.71, 0.18, 0.33, 0.42, 0.33, 0.31, 0.32, 0.33, 0.89, 0.4, 0.8, 0.32, 0.49, 0.92, 0.22, 0.7, 0.95, 0.65, 0.14, 0.77, 0.27, 0.3, 0.53, 0.75, 0.26, 0.24, 0.01, 0.9, 0.3, 0.65, 0.8, 0.25, 0.98, 0.72, 0.41, 0.08, 0.27, 0.78, 0.76, 0.65, 0.72, 0.76, 0.78, 0.42, 0.64, 0.75, 0.48, 0.28, 0.75, 0.1, 0.44, 0.76, 0.48, 0.7, 0.41, 0.78, 0.23, 0.62, 0.77, 0.42, 0.76, 0.27, 0.4, 0.65, 0.72, 0.28, 0.63, 0.06, 0.48, 0.78, 0.27, 0.65, 0.78, 0.3, 0.12, 0.29, 0.31, 0.49, 0.29, 0.64, 0.14, 0.31, 0.51, 0.29, 0.84, 0.19, 0.19, 0.3, 0.55, 0.02, 0.29, 0.3, 0.12, 0.09, 0.29, 0.78, 0.31, 0.25, 0.29, 0.4, 0.81, 0.31, 0.61, 0.25, 0.26, 0.1, 0.31, 0.18, 0.22, 0.56, 0.09, 0.9, 0.81, 0.9, 0.8, 0.85, 0.56, 0.78, 0.78, 0.77, 0.9, 0.86, 0.82, 0.75, 0.78, 0.79, 0.8, 0.87, 0.58, 0.79, 0.7, 0.81, 0.76, 0.5, 0.66, 0.71, 0.65, 0.77, 0.86, 0.83, 0.89, 0.9, 0.85, 0.71, 0.83, 0.89, 0.93, 0.47, 0.8, 0.75, 0.68, 0.33, 0.24, 0.25, 0.3, 0.33, 0.29, 0.2, 0.25, 0.24, 0.26, 0.1, 0.05, 0.26, 0.1, 0.01, 0.28, 0.32, 0.1, 0.3, 0.28, 0.24, 0.26, 0.25, 0.28, 0.18, 0.24, 0.15, 0.19, 0.29, 0.1, 0.33, 0.3, 0.13, 0.25, 0.28, 0.14, 0.34, 0.26, 0.1, 0.3, 0.3, 0.34, 0.6, 0.66, 0.65, 0.59, 0.45, 0.6, 0.25, 0.66, 0.62, 0.45, 0.55, 0.25, 0.59, 0.56, 0.51, 0.51, 0.67, 0.58, 0.53, 0.67, 0.67, 0.56, 0.34, 0.54, 0.67, 0.59, 0.54, 0.3, 0.55, 0.45, 0.83, 0.67, 0.65, 0.5, 0.58, 0.56, 0.48, 0.64)

knowledge_train_data <- tibble(UNS=rep(c("High", "Low", "Medium"), each=40))
mtxData <- data.frame(matrix(vecData, nrow=120, byrow=FALSE))
names(mtxData) <- c("STG", "SCG", "STR", "LPR", "PEG")
knowledge_train_data <- bind_cols(as.tibble(mtxData), knowledge_train_data)
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
glimpse(knowledge_train_data)
## Observations: 120
## Variables: 6
## $ STG <dbl> 0.080, 0.180, 0.100, 0.120, 0.090, 0.080, 0.200, 0.200, 0....
## $ SCG <dbl> 0.080, 0.180, 0.100, 0.120, 0.300, 0.325, 0.450, 0.490, 0....
## $ STR <dbl> 0.10, 0.55, 0.70, 0.75, 0.68, 0.62, 0.28, 0.60, 0.85, 0.71...
## $ LPR <dbl> 0.24, 0.30, 0.15, 0.35, 0.18, 0.94, 0.31, 0.20, 0.38, 0.71...
## $ PEG <dbl> 0.90, 0.81, 0.90, 0.80, 0.85, 0.56, 0.78, 0.78, 0.77, 0.90...
## $ UNS <chr> "High", "High", "High", "High", "High", "High", "High", "H...
library(mlr)
## Loading required package: ParamHelpers
## 
## Attaching package: 'ParamHelpers'
## The following object is masked from 'package:pkgmaker':
## 
##     isInteger
## 
## Attaching package: 'mlr'
## The following object is masked _by_ '.GlobalEnv':
## 
##     rmse
## The following object is masked from 'package:caret':
## 
##     train
## The following object is masked from 'package:e1071':
## 
##     impute
## The following object is masked from 'package:processmapR':
## 
##     performance
# Create classification taks
task <- mlr::makeClassifTask(data = knowledge_train_data, target = "UNS")
## Warning in makeTask(type = type, data = data, weights = weights, blocking
## = blocking, : Provided data is not a pure data.frame but from class tbl_df,
## hence it will be converted.
# Call the list of learners
mlr::listLearners() %>%
    as.data.frame() %>%
    select(class, short.name, package) %>%
    filter(grepl("classif.", class))
## Warning in listLearners.character(obj = NA_character_, properties, quiet, : The following learners could not be constructed, probably because their packages are not installed:
## classif.ada,classif.bartMachine,classif.boosting,classif.bst,classif.clusterSVM,classif.dbnDNN,classif.dcSVM,classif.earth,classif.evtree,classif.extraTrees,classif.fdausc.glm,classif.fdausc.kernel,classif.fdausc.knn,classif.fdausc.np,classif.gamboost,classif.gaterSVM,classif.geoDA,classif.glmboost,classif.kknn,classif.LiblineaRL1L2SVC,classif.LiblineaRL1LogReg,classif.LiblineaRL2L1SVC,classif.LiblineaRL2LogReg,classif.LiblineaRL2SVC,classif.LiblineaRMultiClassSVC,classif.linDA,classif.liquidSVM,classif.mda,classif.mlp,classif.neuralnet,classif.nnTrain,classif.nodeHarvest,classif.pamr,classif.penalized,classif.plr,classif.quaDA,classif.randomForestSRC,classif.rFerns,classif.rknn,classif.rotationForest,classif.RRF,classif.rrlda,classif.saeDNN,classif.sda,classif.sparseLDA,cluster.cmeans,cluster.kmeans,multilabel.randomForestSRC,multilabel.rFerns,regr.bartMachine,regr.bcart,regr.bgp,regr.bgpllm,regr.blm,regr.brnn,regr.bst,regr.btgp,regr.btgpllm,regr.btlm,regr.crs,regr.earth,regr.evtree,regr.extraTrees,regr.FDboost,regr.frbs,regr.gamboost,regr.glmboost,regr.GPfit,regr.kknn,regr.km,regr.laGP,regr.LiblineaRL2L1SVR,regr.LiblineaRL2L2SVR,regr.liquidSVM,regr.mars,regr.nodeHarvest,regr.penalized,regr.randomForestSRC,regr.rknn,regr.RRF,regr.rsm,regr.slim,surv.CoxBoost,surv.cv.CoxBoost,surv.gamboost,surv.glmboost,surv.randomForestSRC
## Check ?learners to see which packages you need or install mlr with all suggestions.
##                             class          short.name
## 1                     classif.ada                 ada
## 2              classif.adaboostm1          adaboostm1
## 3             classif.bartMachine         bartmachine
## 4                classif.binomial            binomial
## 5                classif.boosting              adabag
## 6                     classif.bst                 bst
## 7                     classif.C50                 C50
## 8                 classif.cforest             cforest
## 9              classif.clusterSVM          clusterSVM
## 10                  classif.ctree               ctree
## 11               classif.cvglmnet            cvglmnet
## 12                 classif.dbnDNN             dbn.dnn
## 13                  classif.dcSVM               dcSVM
## 14                  classif.earth                 fda
## 15                 classif.evtree              evtree
## 16             classif.extraTrees          extraTrees
## 17             classif.fdausc.glm          fdausc.glm
## 18          classif.fdausc.kernel       fdausc.kernel
## 19             classif.fdausc.knn          fdausc.knn
## 20              classif.fdausc.np           fdausc.np
## 21            classif.featureless         featureless
## 22                    classif.fnn                 fnn
## 23               classif.gamboost            gamboost
## 24               classif.gaterSVM            gaterSVM
## 25                classif.gausspr             gausspr
## 26                    classif.gbm                 gbm
## 27                  classif.geoDA               geoda
## 28               classif.glmboost            glmboost
## 29                 classif.glmnet              glmnet
## 30       classif.h2o.deeplearning              h2o.dl
## 31                classif.h2o.gbm             h2o.gbm
## 32                classif.h2o.glm             h2o.glm
## 33       classif.h2o.randomForest              h2o.rf
## 34                    classif.IBk                 ibk
## 35                    classif.J48                 j48
## 36                   classif.JRip                jrip
## 37                   classif.kknn                kknn
## 38                    classif.knn                 knn
## 39                   classif.ksvm                ksvm
## 40                    classif.lda                 lda
## 41       classif.LiblineaRL1L2SVC       liblinl1l2svc
## 42      classif.LiblineaRL1LogReg      liblinl1logreg
## 43       classif.LiblineaRL2L1SVC       liblinl2l1svc
## 44      classif.LiblineaRL2LogReg      liblinl2logreg
## 45         classif.LiblineaRL2SVC         liblinl2svc
## 46 classif.LiblineaRMultiClassSVC liblinmulticlasssvc
## 47                  classif.linDA               linda
## 48              classif.liquidSVM           liquidSVM
## 49                 classif.logreg              logreg
## 50                  classif.lssvm               lssvm
## 51                   classif.lvq1                lvq1
## 52                    classif.mda                 mda
## 53                    classif.mlp                 mlp
## 54               classif.multinom            multinom
## 55             classif.naiveBayes              nbayes
## 56              classif.neuralnet           neuralnet
## 57                   classif.nnet                nnet
## 58                classif.nnTrain            nn.train
## 59            classif.nodeHarvest         nodeHarvest
## 60                   classif.OneR                oner
## 61                   classif.pamr                pamr
## 62                   classif.PART                part
## 63              classif.penalized           penalized
## 64                    classif.plr                 plr
## 65             classif.plsdaCaret          plsdacaret
## 66                 classif.probit              probit
## 67                    classif.qda                 qda
## 68                  classif.quaDA               quada
## 69           classif.randomForest                  rf
## 70        classif.randomForestSRC               rfsrc
## 71                 classif.ranger              ranger
## 72                    classif.rda                 rda
## 73                 classif.rFerns              rFerns
## 74                   classif.rknn                rknn
## 75         classif.rotationForest      rotationForest
## 76                  classif.rpart               rpart
## 77                    classif.RRF                 RRF
## 78                  classif.rrlda               rrlda
## 79                 classif.saeDNN             sae.dnn
## 80                    classif.sda                 sda
## 81              classif.sparseLDA           sparseLDA
## 82                    classif.svm                 svm
## 83                classif.xgboost             xgboost
##                      package
## 1                  ada,rpart
## 2                      RWeka
## 3                bartMachine
## 4                      stats
## 5               adabag,rpart
## 6                  bst,rpart
## 7                        C50
## 8                      party
## 9         SwarmSVM,LiblineaR
## 10                     party
## 11                    glmnet
## 12                   deepnet
## 13            SwarmSVM,e1071
## 14               earth,stats
## 15                    evtree
## 16                extraTrees
## 17                   fda.usc
## 18                   fda.usc
## 19                   fda.usc
## 20                   fda.usc
## 21                       mlr
## 22                       FNN
## 23                    mboost
## 24                  SwarmSVM
## 25                   kernlab
## 26                       gbm
## 27               DiscriMiner
## 28                    mboost
## 29                    glmnet
## 30                       h2o
## 31                       h2o
## 32                       h2o
## 33                       h2o
## 34                     RWeka
## 35                     RWeka
## 36                     RWeka
## 37                      kknn
## 38                     class
## 39                   kernlab
## 40                      MASS
## 41                 LiblineaR
## 42                 LiblineaR
## 43                 LiblineaR
## 44                 LiblineaR
## 45                 LiblineaR
## 46                 LiblineaR
## 47               DiscriMiner
## 48                 liquidSVM
## 49                     stats
## 50                   kernlab
## 51                     class
## 52                       mda
## 53                     RSNNS
## 54                      nnet
## 55                     e1071
## 56                 neuralnet
## 57                      nnet
## 58                   deepnet
## 59               nodeHarvest
## 60                     RWeka
## 61                      pamr
## 62                     RWeka
## 63                 penalized
## 64                   stepPlr
## 65                 caret,pls
## 66                     stats
## 67                      MASS
## 68               DiscriMiner
## 69              randomForest
## 70           randomForestSRC
## 71                    ranger
## 72                      klaR
## 73                    rFerns
## 74                      rknn
## 75            rotationForest
## 76                     rpart
## 77                       RRF
## 78                     rrlda
## 79                   deepnet
## 80                       sda
## 81 sparseLDA,MASS,elasticnet
## 82                     e1071
## 83                   xgboost
# Create learner
lrn <- mlr::makeLearner("classif.randomForest", predict.type = "prob", fix.factors.prediction = TRUE)


# Get the parameter set for neural networks of the nnet package
ParamHelpers::getParamSet("classif.nnet")
##            Type len    Def      Constr Req Tunable Trafo
## size    integer   -      3    0 to Inf   -    TRUE     -
## maxit   integer   -    100    1 to Inf   -    TRUE     -
## skip    logical   -  FALSE           -   -    TRUE     -
## rang    numeric   -    0.7 -Inf to Inf   -    TRUE     -
## decay   numeric   -      0 -Inf to Inf   -    TRUE     -
## Hess    logical   -  FALSE           -   -    TRUE     -
## trace   logical   -   TRUE           -   -   FALSE     -
## MaxNWts integer   -   1000    1 to Inf   -   FALSE     -
## abstol  numeric   - 0.0001 -Inf to Inf   -    TRUE     -
## reltol  numeric   -  1e-08 -Inf to Inf   -    TRUE     -
# Define set of parameters
param_set <- ParamHelpers::makeParamSet(ParamHelpers::makeDiscreteParam("size", values = c(2,3,5)),
                                        ParamHelpers::makeNumericParam("decay", lower = 0.0001, upper = 0.1)
                                        )

# Print parameter set
print(param_set)
##           Type len Def        Constr Req Tunable Trafo
## size  discrete   -   -         2,3,5   -    TRUE     -
## decay  numeric   -   - 0.0001 to 0.1   -    TRUE     -
# Define a random search tuning method.
ctrl_random <- mlr::makeTuneControlRandom()


# Define task
task <- makeClassifTask(data = knowledge_train_data, target = "UNS")
## Warning in makeTask(type = type, data = data, weights = weights, blocking
## = blocking, : Provided data is not a pure data.frame but from class tbl_df,
## hence it will be converted.
# Define learner
lrn <- makeLearner("classif.nnet", predict.type = "prob", fix.factors.prediction = TRUE)

# Define set of parameters
param_set <- makeParamSet(makeDiscreteParam("size", values = c(2,3,5)),
                          makeNumericParam("decay", lower = 0.0001, upper = 0.1)
                          )

# Define a random search tuning method.
ctrl_random <- mlr::makeTuneControlRandom(maxit = 6)

# Define a 2 x 2 repeated cross-validation scheme
cross_val <- mlr::makeResampleDesc("RepCV", folds = 2 * 2)

# Tune hyperparameters
tictoc::tic()
lrn_tune <- mlr::tuneParams(lrn, task, resampling = cross_val, control = ctrl_random, par.set=param_set)
## [Tune] Started tuning learner classif.nnet for parameter set:
##           Type len Def        Constr Req Tunable Trafo
## size  discrete   -   -         2,3,5   -    TRUE     -
## decay  numeric   -   - 0.0001 to 0.1   -    TRUE     -
## With control class: TuneControlRandom
## Imputation value: 1
## [Tune-x] 1: size=2; decay=0.00779
## # weights:  21
## initial  value 112.737276 
## iter  10 value 59.082328
## iter  20 value 19.787952
## iter  30 value 18.064517
## iter  40 value 17.952929
## iter  50 value 17.884020
## iter  60 value 17.869365
## iter  70 value 17.868531
## final  value 17.868519 
## converged
## # weights:  21
## initial  value 102.199839 
## iter  10 value 41.476402
## iter  20 value 21.007165
## iter  30 value 19.846955
## iter  40 value 19.077249
## iter  50 value 17.893235
## iter  60 value 16.894414
## iter  70 value 16.745894
## iter  80 value 16.738340
## iter  90 value 16.737157
## final  value 16.737150 
## converged
## # weights:  21
## initial  value 99.375959 
## iter  10 value 45.481884
## iter  20 value 11.303700
## iter  30 value 9.236685
## iter  40 value 9.102399
## iter  50 value 9.089763
## iter  60 value 9.080980
## iter  70 value 9.080516
## iter  80 value 9.080510
## final  value 9.080510 
## converged
## # weights:  21
## initial  value 101.231658 
## iter  10 value 27.887510
## iter  20 value 18.436825
## iter  30 value 18.173475
## iter  40 value 16.989967
## iter  50 value 15.392260
## iter  60 value 15.152620
## iter  70 value 15.141648
## iter  80 value 15.141448
## final  value 15.141443 
## converged
## # weights:  21
## initial  value 103.763017 
## iter  10 value 28.153080
## iter  20 value 13.128587
## iter  30 value 12.353443
## iter  40 value 12.196811
## iter  50 value 12.188780
## iter  60 value 12.162821
## iter  70 value 12.161957
## final  value 12.161951 
## converged
## # weights:  21
## initial  value 109.317314 
## iter  10 value 22.497005
## iter  20 value 19.218313
## iter  30 value 17.559913
## iter  40 value 17.294557
## iter  50 value 17.194035
## iter  60 value 17.123278
## iter  70 value 17.122103
## final  value 17.122094 
## converged
## # weights:  21
## initial  value 118.090824 
## iter  10 value 64.880621
## iter  20 value 17.665442
## iter  30 value 16.892673
## iter  40 value 16.307504
## iter  50 value 15.406085
## iter  60 value 14.466792
## iter  70 value 14.191129
## iter  80 value 14.177345
## iter  90 value 14.175877
## iter 100 value 14.175815
## final  value 14.175815 
## stopped after 100 iterations
## # weights:  21
## initial  value 98.828661 
## iter  10 value 22.407032
## iter  20 value 17.481025
## iter  30 value 16.857187
## iter  40 value 16.074096
## iter  50 value 15.692011
## iter  60 value 15.429560
## iter  70 value 15.418977
## iter  80 value 15.418827
## final  value 15.418817 
## converged
## # weights:  21
## initial  value 101.619277 
## iter  10 value 38.163020
## iter  20 value 22.213117
## iter  30 value 20.822530
## iter  40 value 19.488851
## iter  50 value 18.018561
## iter  60 value 17.798223
## iter  70 value 17.773771
## iter  80 value 17.773581
## final  value 17.773574 
## converged
## # weights:  21
## initial  value 100.731946 
## iter  10 value 38.267933
## iter  20 value 19.672372
## iter  30 value 18.911917
## iter  40 value 18.558480
## iter  50 value 17.258065
## iter  60 value 16.861026
## iter  70 value 16.827239
## iter  80 value 16.826900
## final  value 16.826876 
## converged
## # weights:  21
## initial  value 115.768265 
## iter  10 value 88.665348
## iter  20 value 26.212407
## iter  30 value 17.600437
## iter  40 value 16.925272
## iter  50 value 16.100945
## iter  60 value 14.375765
## iter  70 value 12.751145
## iter  80 value 12.584634
## iter  90 value 12.577752
## iter 100 value 12.577195
## final  value 12.577195 
## stopped after 100 iterations
## # weights:  21
## initial  value 112.265356 
## iter  10 value 45.306133
## iter  20 value 13.879122
## iter  30 value 12.666204
## iter  40 value 12.333084
## iter  50 value 12.243346
## iter  60 value 12.191870
## iter  70 value 12.189015
## final  value 12.189010 
## converged
## # weights:  21
## initial  value 100.973896 
## iter  10 value 60.966338
## iter  20 value 19.806172
## iter  30 value 18.313350
## iter  40 value 18.128526
## iter  50 value 18.071983
## iter  60 value 18.033742
## iter  70 value 18.033178
## final  value 18.033164 
## converged
## # weights:  21
## initial  value 101.175728 
## iter  10 value 53.762590
## iter  20 value 17.422167
## iter  30 value 16.043078
## iter  40 value 14.997529
## iter  50 value 14.821277
## iter  60 value 14.720612
## iter  70 value 14.712664
## iter  80 value 14.712381
## final  value 14.712378 
## converged
## # weights:  21
## initial  value 107.713607 
## iter  10 value 69.080911
## iter  20 value 38.532498
## iter  30 value 25.196611
## iter  40 value 20.088302
## iter  50 value 17.698423
## iter  60 value 16.998775
## iter  70 value 16.051191
## iter  80 value 13.306311
## iter  90 value 12.372010
## iter 100 value 12.186085
## final  value 12.186085 
## stopped after 100 iterations
## # weights:  21
## initial  value 122.894748 
## iter  10 value 55.959344
## iter  20 value 15.287731
## iter  30 value 14.405829
## iter  40 value 14.031984
## iter  50 value 13.877002
## iter  60 value 13.855511
## iter  70 value 13.853268
## iter  80 value 13.853077
## final  value 13.853066 
## converged
## # weights:  21
## initial  value 102.781029 
## iter  10 value 39.478082
## iter  20 value 21.367085
## iter  30 value 20.140264
## iter  40 value 18.312745
## iter  50 value 17.867227
## iter  60 value 16.344864
## iter  70 value 15.556351
## iter  80 value 15.496841
## iter  90 value 15.491891
## iter 100 value 15.491743
## final  value 15.491743 
## stopped after 100 iterations
## # weights:  21
## initial  value 117.702933 
## iter  10 value 89.924185
## iter  20 value 28.816578
## iter  30 value 19.060347
## iter  40 value 18.130611
## iter  50 value 17.770208
## iter  60 value 17.681002
## iter  70 value 17.679080
## iter  80 value 17.679030
## final  value 17.679029 
## converged
## # weights:  21
## initial  value 114.016663 
## iter  10 value 27.475688
## iter  20 value 16.855568
## iter  30 value 14.459688
## iter  40 value 12.888182
## iter  50 value 11.644679
## iter  60 value 11.445156
## iter  70 value 11.431491
## iter  80 value 11.431331
## final  value 11.431329 
## converged
## # weights:  21
## initial  value 99.462865 
## iter  10 value 34.295267
## iter  20 value 19.197133
## iter  30 value 17.189905
## iter  40 value 16.182226
## iter  50 value 16.105666
## iter  60 value 16.082446
## iter  70 value 16.080713
## final  value 16.080702 
## converged
## # weights:  21
## initial  value 103.887812 
## iter  10 value 49.227809
## iter  20 value 17.457461
## iter  30 value 12.595677
## iter  40 value 11.770740
## iter  50 value 9.766838
## iter  60 value 9.070627
## iter  70 value 8.955469
## iter  80 value 8.953074
## final  value 8.953023 
## converged
## # weights:  21
## initial  value 104.712726 
## iter  10 value 62.014017
## iter  20 value 21.311872
## iter  30 value 19.734365
## iter  40 value 18.682411
## iter  50 value 18.540703
## iter  60 value 17.859976
## iter  70 value 16.105503
## iter  80 value 15.820356
## iter  90 value 15.806332
## iter 100 value 15.801633
## final  value 15.801633 
## stopped after 100 iterations
## # weights:  21
## initial  value 100.548011 
## iter  10 value 51.276972
## iter  20 value 21.098265
## iter  30 value 19.751831
## iter  40 value 18.733961
## iter  50 value 18.450500
## iter  60 value 18.411604
## iter  70 value 18.408414
## final  value 18.408391 
## converged
## # weights:  21
## initial  value 104.288933 
## iter  10 value 39.072145
## iter  20 value 20.281542
## iter  30 value 19.458310
## iter  40 value 18.223241
## iter  50 value 17.624455
## iter  60 value 17.355498
## iter  70 value 17.328431
## iter  80 value 17.328044
## final  value 17.328036 
## converged
## # weights:  21
## initial  value 114.214241 
## iter  10 value 80.869627
## iter  20 value 15.349282
## iter  30 value 14.213546
## iter  40 value 14.194575
## iter  50 value 14.181571
## iter  60 value 14.177726
## iter  70 value 14.176879
## iter  80 value 14.176837
## final  value 14.176837 
## converged
## # weights:  21
## initial  value 102.809099 
## iter  10 value 77.921095
## iter  20 value 19.882880
## iter  30 value 15.748019
## iter  40 value 14.738512
## iter  50 value 13.531410
## iter  60 value 13.341704
## iter  70 value 13.241870
## iter  80 value 13.237638
## final  value 13.237515 
## converged
## # weights:  21
## initial  value 111.720668 
## iter  10 value 33.667679
## iter  20 value 23.759344
## iter  30 value 20.718672
## iter  40 value 18.723184
## iter  50 value 18.209350
## iter  60 value 18.037709
## iter  70 value 18.029824
## iter  80 value 18.029689
## final  value 18.029683 
## converged
## # weights:  21
## initial  value 104.334473 
## iter  10 value 31.636296
## iter  20 value 15.603153
## iter  30 value 14.847099
## iter  40 value 14.251179
## iter  50 value 14.056077
## iter  60 value 13.974030
## iter  70 value 13.968621
## final  value 13.968617 
## converged
## # weights:  21
## initial  value 102.594081 
## iter  10 value 27.031048
## iter  20 value 19.968774
## iter  30 value 19.160768
## iter  40 value 18.315719
## iter  50 value 17.962175
## iter  60 value 17.780765
## iter  70 value 17.776503
## iter  80 value 17.776468
## final  value 17.776466 
## converged
## # weights:  21
## initial  value 106.070284 
## iter  10 value 48.870288
## iter  20 value 18.453964
## iter  30 value 17.291129
## iter  40 value 14.692545
## iter  50 value 13.963406
## iter  60 value 13.875929
## iter  70 value 13.870289
## iter  80 value 13.869936
## final  value 13.869925 
## converged
## # weights:  21
## initial  value 105.045744 
## iter  10 value 47.772935
## iter  20 value 22.228715
## iter  30 value 18.950780
## iter  40 value 18.304071
## iter  50 value 17.820194
## iter  60 value 17.534296
## iter  70 value 17.521210
## iter  80 value 17.520818
## final  value 17.520806 
## converged
## # weights:  21
## initial  value 104.213667 
## iter  10 value 34.144032
## iter  20 value 17.164156
## iter  30 value 11.656114
## iter  40 value 10.411400
## iter  50 value 10.161250
## iter  60 value 10.145989
## iter  70 value 10.145053
## iter  80 value 10.144940
## final  value 10.144933 
## converged
## # weights:  21
## initial  value 98.281324 
## iter  10 value 28.888058
## iter  20 value 21.213893
## iter  30 value 19.672023
## iter  40 value 18.740467
## iter  50 value 18.530935
## iter  60 value 18.361547
## iter  70 value 18.359102
## final  value 18.359062 
## converged
## # weights:  21
## initial  value 119.356596 
## iter  10 value 87.287341
## iter  20 value 23.033563
## iter  30 value 16.892631
## iter  40 value 16.411949
## iter  50 value 16.121386
## iter  60 value 15.951808
## iter  70 value 15.935035
## iter  80 value 15.934605
## final  value 15.934601 
## converged
## # weights:  21
## initial  value 99.650533 
## iter  10 value 47.049033
## iter  20 value 20.939465
## iter  30 value 17.599154
## iter  40 value 16.841298
## iter  50 value 15.360520
## iter  60 value 13.305444
## iter  70 value 12.017469
## iter  80 value 11.830264
## iter  90 value 11.708991
## iter 100 value 11.706945
## final  value 11.706945 
## stopped after 100 iterations
## # weights:  21
## initial  value 100.142880 
## iter  10 value 45.962487
## iter  20 value 16.054916
## iter  30 value 14.138142
## iter  40 value 13.913566
## iter  50 value 13.732967
## iter  60 value 13.658799
## iter  70 value 13.656558
## iter  80 value 13.656445
## final  value 13.656444 
## converged
## # weights:  21
## initial  value 101.093191 
## iter  10 value 41.149834
## iter  20 value 13.646982
## iter  30 value 12.901699
## iter  40 value 12.657001
## iter  50 value 12.294650
## iter  60 value 12.249010
## iter  70 value 12.242734
## iter  80 value 12.242566
## final  value 12.242553 
## converged
## # weights:  21
## initial  value 111.428036 
## iter  10 value 72.939356
## iter  20 value 19.995742
## iter  30 value 17.804604
## iter  40 value 16.462203
## iter  50 value 15.810207
## iter  60 value 15.575160
## iter  70 value 15.535525
## iter  80 value 15.532111
## iter  90 value 15.532101
## final  value 15.532101 
## converged
## # weights:  21
## initial  value 106.911638 
## iter  10 value 43.866986
## iter  20 value 20.423534
## iter  30 value 18.582141
## iter  40 value 17.658185
## iter  50 value 16.422369
## iter  60 value 15.745003
## iter  70 value 15.161795
## iter  80 value 15.114947
## iter  90 value 15.114449
## final  value 15.114418 
## converged
## # weights:  21
## initial  value 112.426143 
## iter  10 value 37.801097
## iter  20 value 18.500269
## iter  30 value 17.785433
## iter  40 value 17.507623
## iter  50 value 17.404310
## iter  60 value 17.396385
## iter  70 value 17.395876
## final  value 17.395868 
## converged
## [Tune-y] 1: mmce.test.mean=0.0683333; time: 0.0 min
## [Tune-x] 2: size=2; decay=0.0999
## # weights:  21
## initial  value 105.511103 
## iter  10 value 55.189299
## iter  20 value 50.721806
## iter  30 value 47.445012
## iter  40 value 47.354530
## iter  50 value 47.347136
## final  value 47.347133 
## converged
## # weights:  21
## initial  value 107.010000 
## iter  10 value 62.513355
## iter  20 value 46.370487
## iter  30 value 46.323294
## final  value 46.322992 
## converged
## # weights:  21
## initial  value 102.478866 
## iter  10 value 51.054032
## iter  20 value 44.793396
## iter  30 value 44.513571
## iter  40 value 44.511986
## final  value 44.511968 
## converged
## # weights:  21
## initial  value 102.768070 
## iter  10 value 56.874262
## iter  20 value 47.306034
## iter  30 value 47.212600
## final  value 47.212549 
## converged
## # weights:  21
## initial  value 101.669360 
## iter  10 value 52.824498
## iter  20 value 46.249418
## iter  30 value 45.856901
## final  value 45.856886 
## converged
## # weights:  21
## initial  value 99.132985 
## iter  10 value 60.799939
## iter  20 value 51.646099
## iter  30 value 50.636378
## iter  40 value 49.985909
## iter  50 value 49.954302
## final  value 49.954255 
## converged
## # weights:  21
## initial  value 100.639040 
## iter  10 value 65.272319
## iter  20 value 50.463102
## iter  30 value 49.087443
## iter  40 value 48.995773
## final  value 48.995657 
## converged
## # weights:  21
## initial  value 102.853680 
## iter  10 value 58.038007
## iter  20 value 52.862041
## iter  30 value 48.523721
## iter  40 value 47.191848
## iter  50 value 47.182403
## final  value 47.182310 
## converged
## # weights:  21
## initial  value 121.952585 
## iter  10 value 83.197399
## iter  20 value 48.893173
## iter  30 value 48.517445
## iter  40 value 48.472165
## iter  50 value 48.464916
## final  value 48.464844 
## converged
## # weights:  21
## initial  value 106.799670 
## iter  10 value 73.204499
## iter  20 value 51.755842
## iter  30 value 50.368388
## iter  40 value 50.348957
## iter  40 value 50.348957
## iter  40 value 50.348957
## final  value 50.348957 
## converged
## # weights:  21
## initial  value 100.347623 
## iter  10 value 68.629781
## iter  20 value 47.915084
## iter  30 value 47.339885
## iter  40 value 47.328921
## iter  40 value 47.328921
## iter  40 value 47.328921
## final  value 47.328921 
## converged
## # weights:  21
## initial  value 107.063314 
## iter  10 value 73.243069
## iter  20 value 46.185791
## iter  30 value 45.424616
## iter  40 value 45.384692
## final  value 45.384580 
## converged
## # weights:  21
## initial  value 101.737198 
## iter  10 value 57.706092
## iter  20 value 51.428916
## iter  30 value 48.336822
## iter  40 value 48.225455
## final  value 48.225442 
## converged
## # weights:  21
## initial  value 106.204221 
## iter  10 value 61.477756
## iter  20 value 47.907891
## iter  30 value 46.888763
## iter  40 value 46.731646
## final  value 46.706424 
## converged
## # weights:  21
## initial  value 98.844757 
## iter  10 value 49.393973
## iter  20 value 45.133908
## iter  30 value 45.068202
## final  value 45.068129 
## converged
## # weights:  21
## initial  value 99.976960 
## iter  10 value 79.206562
## iter  20 value 57.510071
## iter  30 value 48.809920
## iter  40 value 46.413309
## iter  50 value 46.321252
## iter  60 value 46.318303
## iter  60 value 46.318303
## iter  60 value 46.318303
## final  value 46.318303 
## converged
## # weights:  21
## initial  value 100.494999 
## iter  10 value 60.541286
## iter  20 value 47.151665
## iter  30 value 47.091514
## final  value 47.089953 
## converged
## # weights:  21
## initial  value 101.427179 
## iter  10 value 62.515919
## iter  20 value 50.496714
## iter  30 value 47.809597
## iter  40 value 47.771838
## final  value 47.771835 
## converged
## # weights:  21
## initial  value 107.465270 
## iter  10 value 60.666569
## iter  20 value 45.507187
## iter  30 value 45.351946
## final  value 45.350380 
## converged
## # weights:  21
## initial  value 103.132640 
## iter  10 value 53.045474
## iter  20 value 47.358081
## iter  30 value 47.216070
## final  value 47.216015 
## converged
## # weights:  21
## initial  value 103.156596 
## iter  10 value 48.487910
## iter  20 value 44.055743
## iter  30 value 43.900406
## iter  40 value 43.899942
## iter  40 value 43.899941
## iter  40 value 43.899941
## final  value 43.899941 
## converged
## # weights:  21
## initial  value 108.757009 
## iter  10 value 63.882752
## iter  20 value 48.699881
## iter  30 value 45.801913
## iter  40 value 45.686879
## final  value 45.686587 
## converged
## # weights:  21
## initial  value 99.077609 
## iter  10 value 72.042285
## iter  20 value 52.758616
## iter  30 value 49.163800
## iter  40 value 49.129994
## iter  50 value 49.129611
## final  value 49.129470 
## converged
## # weights:  21
## initial  value 105.026079 
## iter  10 value 55.731922
## iter  20 value 48.920381
## iter  30 value 48.777880
## iter  40 value 48.774723
## iter  40 value 48.774722
## iter  40 value 48.774722
## final  value 48.774722 
## converged
## # weights:  21
## initial  value 99.847288 
## iter  10 value 69.883903
## iter  20 value 50.663993
## iter  30 value 50.448296
## iter  40 value 50.431366
## final  value 50.431344 
## converged
## # weights:  21
## initial  value 103.520291 
## iter  10 value 68.141953
## iter  20 value 51.550142
## iter  30 value 49.951378
## iter  40 value 49.827067
## final  value 49.827048 
## converged
## # weights:  21
## initial  value 100.851900 
## iter  10 value 74.062044
## iter  20 value 57.238902
## iter  30 value 52.915850
## iter  40 value 48.916691
## iter  50 value 48.062901
## iter  60 value 47.900354
## final  value 47.900271 
## converged
## # weights:  21
## initial  value 107.947121 
## iter  10 value 67.369191
## iter  20 value 50.500349
## iter  30 value 47.034081
## iter  40 value 46.095383
## iter  50 value 45.837554
## iter  60 value 45.832909
## final  value 45.832908 
## converged
## # weights:  21
## initial  value 99.192064 
## iter  10 value 56.736631
## iter  20 value 48.817628
## iter  30 value 48.166327
## iter  40 value 48.164702
## iter  40 value 48.164702
## iter  40 value 48.164702
## final  value 48.164702 
## converged
## # weights:  21
## initial  value 99.197385 
## iter  10 value 59.613864
## iter  20 value 50.812147
## iter  30 value 50.109840
## final  value 50.107701 
## converged
## # weights:  21
## initial  value 105.216863 
## iter  10 value 52.879794
## iter  20 value 48.284889
## iter  30 value 46.919335
## iter  40 value 46.840500
## final  value 46.840497 
## converged
## # weights:  21
## initial  value 102.193454 
## iter  10 value 75.281421
## iter  20 value 47.628114
## iter  30 value 45.559107
## iter  40 value 44.868556
## iter  50 value 44.854066
## final  value 44.854015 
## converged
## # weights:  21
## initial  value 104.641849 
## iter  10 value 70.109824
## iter  20 value 51.460422
## iter  30 value 51.065798
## iter  40 value 51.044538
## final  value 51.044521 
## converged
## # weights:  21
## initial  value 110.729356 
## iter  10 value 80.163077
## iter  20 value 58.542241
## iter  30 value 51.182958
## iter  40 value 47.817081
## iter  50 value 47.016360
## iter  60 value 46.972613
## final  value 46.972612 
## converged
## # weights:  21
## initial  value 107.155933 
## iter  10 value 62.650205
## iter  20 value 51.106241
## iter  30 value 46.464632
## iter  40 value 46.215682
## iter  50 value 46.212774
## iter  50 value 46.212773
## iter  50 value 46.212773
## final  value 46.212773 
## converged
## # weights:  21
## initial  value 107.574280 
## iter  10 value 75.118994
## iter  20 value 53.866165
## iter  30 value 50.154094
## iter  40 value 49.514868
## iter  50 value 49.450284
## final  value 49.450230 
## converged
## # weights:  21
## initial  value 108.971320 
## iter  10 value 67.147973
## iter  20 value 45.904437
## iter  30 value 45.422006
## iter  40 value 45.358473
## final  value 45.358461 
## converged
## # weights:  21
## initial  value 99.846107 
## iter  10 value 69.061050
## iter  20 value 51.125843
## iter  30 value 49.483493
## iter  40 value 49.199557
## iter  40 value 49.199557
## iter  40 value 49.199557
## final  value 49.199557 
## converged
## # weights:  21
## initial  value 98.736172 
## iter  10 value 64.017420
## iter  20 value 55.999660
## iter  30 value 48.850349
## iter  40 value 48.274861
## iter  50 value 48.247935
## final  value 48.247279 
## converged
## # weights:  21
## initial  value 113.764781 
## iter  10 value 81.349232
## iter  20 value 53.780063
## iter  30 value 51.893226
## iter  40 value 51.192481
## iter  50 value 51.124995
## final  value 51.122976 
## converged
## [Tune-y] 2: mmce.test.mean=0.0558333; time: 0.0 min
## [Tune-x] 3: size=3; decay=0.0503
## # weights:  30
## initial  value 104.045824 
## iter  10 value 89.216279
## iter  20 value 44.273581
## iter  30 value 35.896994
## iter  40 value 35.721667
## iter  50 value 35.031246
## iter  60 value 34.802736
## iter  70 value 34.697341
## iter  80 value 34.646863
## iter  90 value 34.638313
## iter 100 value 34.637681
## final  value 34.637681 
## stopped after 100 iterations
## # weights:  30
## initial  value 126.091482 
## iter  10 value 60.068302
## iter  20 value 37.272028
## iter  30 value 35.752393
## iter  40 value 33.810930
## iter  50 value 33.360978
## iter  60 value 33.276964
## iter  70 value 33.218350
## iter  80 value 33.212723
## final  value 33.212553 
## converged
## # weights:  30
## initial  value 105.728533 
## iter  10 value 52.676896
## iter  20 value 32.463078
## iter  30 value 30.861966
## iter  40 value 30.074642
## iter  50 value 29.757499
## iter  60 value 29.717727
## iter  70 value 29.717277
## final  value 29.717262 
## converged
## # weights:  30
## initial  value 111.883618 
## iter  10 value 69.245716
## iter  20 value 35.246621
## iter  30 value 33.886549
## iter  40 value 33.646168
## iter  50 value 33.505204
## iter  60 value 33.484457
## iter  70 value 33.479709
## iter  80 value 33.479337
## final  value 33.479336 
## converged
## # weights:  30
## initial  value 101.715178 
## iter  10 value 40.551187
## iter  20 value 32.620583
## iter  30 value 32.093706
## iter  40 value 31.721609
## iter  50 value 31.689953
## iter  60 value 31.670001
## final  value 31.669951 
## converged
## # weights:  30
## initial  value 101.711293 
## iter  10 value 46.934490
## iter  20 value 35.589842
## iter  30 value 34.844398
## iter  40 value 34.509847
## iter  50 value 34.441780
## iter  60 value 34.369959
## iter  70 value 34.202666
## iter  80 value 33.861411
## iter  90 value 33.844532
## final  value 33.844297 
## converged
## # weights:  30
## initial  value 107.281428 
## iter  10 value 51.418402
## iter  20 value 33.522796
## iter  30 value 32.686601
## iter  40 value 32.455238
## iter  50 value 32.442006
## iter  60 value 32.406639
## iter  70 value 32.402306
## iter  80 value 32.402229
## final  value 32.402224 
## converged
## # weights:  30
## initial  value 100.524304 
## iter  10 value 55.808354
## iter  20 value 37.102935
## iter  30 value 34.930836
## iter  40 value 33.949684
## iter  50 value 33.666228
## iter  60 value 33.601186
## iter  70 value 33.575276
## iter  80 value 33.571751
## iter  90 value 33.571608
## iter  90 value 33.571608
## iter  90 value 33.571608
## final  value 33.571608 
## converged
## # weights:  30
## initial  value 101.887324 
## iter  10 value 54.046076
## iter  20 value 41.282893
## iter  30 value 36.234298
## iter  40 value 35.439521
## iter  50 value 35.309915
## iter  60 value 35.298051
## iter  70 value 35.297355
## iter  80 value 35.297292
## final  value 35.297291 
## converged
## # weights:  30
## initial  value 105.321864 
## iter  10 value 56.143846
## iter  20 value 36.491622
## iter  30 value 34.465435
## iter  40 value 34.011924
## iter  50 value 33.985391
## iter  60 value 33.980058
## iter  70 value 33.975577
## iter  80 value 33.973318
## final  value 33.973311 
## converged
## # weights:  30
## initial  value 116.664737 
## iter  10 value 92.534745
## iter  20 value 41.198301
## iter  30 value 33.941819
## iter  40 value 32.982966
## iter  50 value 31.831567
## iter  60 value 30.716516
## iter  70 value 30.667231
## iter  80 value 30.633250
## iter  90 value 30.629251
## final  value 30.629229 
## converged
## # weights:  30
## initial  value 104.728010 
## iter  10 value 50.357290
## iter  20 value 33.027284
## iter  30 value 32.245762
## iter  40 value 31.543562
## iter  50 value 31.384294
## iter  60 value 31.353650
## iter  70 value 31.352809
## iter  80 value 31.348333
## iter  90 value 31.316799
## iter 100 value 31.285401
## final  value 31.285401 
## stopped after 100 iterations
## # weights:  30
## initial  value 125.927582 
## iter  10 value 51.373456
## iter  20 value 36.081778
## iter  30 value 35.536192
## iter  40 value 35.260272
## iter  50 value 35.154762
## iter  60 value 35.120373
## iter  70 value 35.104741
## iter  80 value 35.103340
## final  value 35.103310 
## converged
## # weights:  30
## initial  value 100.920969 
## iter  10 value 37.793392
## iter  20 value 33.415857
## iter  30 value 32.828423
## iter  40 value 32.518932
## iter  50 value 32.476236
## iter  60 value 32.470227
## iter  70 value 32.467205
## iter  80 value 32.466698
## final  value 32.466689 
## converged
## # weights:  30
## initial  value 104.120202 
## iter  10 value 40.512773
## iter  20 value 33.585053
## iter  30 value 32.113128
## iter  40 value 31.553206
## iter  50 value 31.341912
## iter  60 value 31.338100
## iter  70 value 31.335197
## iter  80 value 31.334994
## final  value 31.334982 
## converged
## # weights:  30
## initial  value 117.702334 
## iter  10 value 94.433265
## iter  20 value 39.313621
## iter  30 value 33.420217
## iter  40 value 32.549640
## iter  50 value 32.547177
## iter  60 value 32.546934
## iter  70 value 32.543511
## iter  80 value 32.533457
## iter  90 value 32.507902
## iter 100 value 32.501787
## final  value 32.501787 
## stopped after 100 iterations
## # weights:  30
## initial  value 113.329304 
## iter  10 value 47.751532
## iter  20 value 33.355733
## iter  30 value 32.921993
## iter  40 value 32.914024
## iter  50 value 32.911425
## iter  60 value 32.910534
## iter  70 value 32.909937
## iter  80 value 32.909779
## final  value 32.909775 
## converged
## # weights:  30
## initial  value 106.842180 
## iter  10 value 64.438227
## iter  20 value 35.574954
## iter  30 value 34.968275
## iter  40 value 34.781675
## iter  50 value 34.746398
## iter  60 value 34.654962
## iter  70 value 34.618494
## iter  80 value 34.610445
## iter  90 value 34.610194
## final  value 34.610192 
## converged
## # weights:  30
## initial  value 98.847924 
## iter  10 value 38.560949
## iter  20 value 32.125703
## iter  30 value 31.559143
## iter  40 value 31.278467
## iter  50 value 31.228005
## iter  60 value 31.225191
## iter  70 value 31.221440
## iter  80 value 31.219950
## iter  90 value 31.219900
## iter  90 value 31.219900
## iter  90 value 31.219900
## final  value 31.219900 
## converged
## # weights:  30
## initial  value 106.355014 
## iter  10 value 48.069203
## iter  20 value 38.499356
## iter  30 value 36.399479
## iter  40 value 36.085261
## iter  50 value 36.057782
## iter  60 value 35.342082
## iter  70 value 33.734401
## iter  80 value 33.469685
## iter  90 value 33.467871
## iter 100 value 33.467587
## final  value 33.467587 
## stopped after 100 iterations
## # weights:  30
## initial  value 103.526395 
## iter  10 value 47.528053
## iter  20 value 32.043913
## iter  30 value 29.642669
## iter  40 value 29.372169
## iter  50 value 29.268549
## iter  60 value 29.251545
## iter  70 value 29.250948
## iter  80 value 29.250644
## final  value 29.250642 
## converged
## # weights:  30
## initial  value 101.429885 
## iter  10 value 43.515628
## iter  20 value 34.108701
## iter  30 value 33.388556
## iter  40 value 32.947467
## iter  50 value 32.845407
## iter  60 value 32.841785
## iter  70 value 32.841246
## iter  80 value 32.840794
## final  value 32.840779 
## converged
## # weights:  30
## initial  value 107.159448 
## iter  10 value 78.082167
## iter  20 value 41.384080
## iter  30 value 38.313083
## iter  40 value 36.501628
## iter  50 value 35.697330
## iter  60 value 35.373152
## iter  70 value 35.319274
## iter  80 value 35.298602
## iter  90 value 35.297230
## final  value 35.297230 
## converged
## # weights:  30
## initial  value 107.198571 
## iter  10 value 53.017985
## iter  20 value 36.788016
## iter  30 value 35.806080
## iter  40 value 34.930265
## iter  50 value 34.656193
## iter  60 value 34.635057
## iter  70 value 34.632329
## iter  80 value 34.631284
## final  value 34.631243 
## converged
## # weights:  30
## initial  value 123.335430 
## iter  10 value 46.633187
## iter  20 value 34.793149
## iter  30 value 33.137611
## iter  40 value 32.947083
## iter  50 value 32.916441
## iter  60 value 32.868035
## iter  70 value 32.825336
## iter  80 value 32.822928
## iter  90 value 32.822612
## final  value 32.822611 
## converged
## # weights:  30
## initial  value 98.083825 
## iter  10 value 45.359270
## iter  20 value 32.627439
## iter  30 value 32.258574
## iter  40 value 32.206372
## iter  50 value 32.187926
## iter  60 value 32.176890
## iter  70 value 32.175087
## iter  80 value 32.175069
## final  value 32.175068 
## converged
## # weights:  30
## initial  value 102.222835 
## iter  10 value 44.955032
## iter  20 value 35.608312
## iter  30 value 34.954348
## iter  40 value 34.753083
## iter  50 value 34.703087
## iter  60 value 34.698048
## iter  70 value 34.697728
## final  value 34.697720 
## converged
## # weights:  30
## initial  value 108.986123 
## iter  10 value 47.462559
## iter  20 value 37.484609
## iter  30 value 33.670830
## iter  40 value 32.290331
## iter  50 value 32.060552
## iter  60 value 31.912677
## iter  70 value 31.884550
## iter  80 value 31.878357
## iter  90 value 31.878155
## iter  90 value 31.878155
## iter  90 value 31.878155
## final  value 31.878155 
## converged
## # weights:  30
## initial  value 103.614007 
## iter  10 value 49.969698
## iter  20 value 40.235476
## iter  30 value 37.123043
## iter  40 value 36.104189
## iter  50 value 35.498184
## iter  60 value 35.200648
## iter  70 value 35.069614
## iter  80 value 35.052874
## iter  90 value 35.052526
## final  value 35.052526 
## converged
## # weights:  30
## initial  value 100.832064 
## iter  10 value 48.094842
## iter  20 value 37.546268
## iter  30 value 35.907724
## iter  40 value 35.554689
## iter  50 value 34.058268
## iter  60 value 33.338087
## iter  70 value 32.857936
## iter  80 value 32.578767
## iter  90 value 32.570203
## iter 100 value 32.569977
## final  value 32.569977 
## stopped after 100 iterations
## # weights:  30
## initial  value 100.431086 
## iter  10 value 59.899575
## iter  20 value 35.493799
## iter  30 value 34.917733
## iter  40 value 34.445500
## iter  50 value 34.059238
## iter  60 value 34.034338
## iter  70 value 34.026379
## iter  80 value 34.025591
## final  value 34.025582 
## converged
## # weights:  30
## initial  value 105.063390 
## iter  10 value 44.930809
## iter  20 value 31.488770
## iter  30 value 30.539919
## iter  40 value 30.343821
## iter  50 value 30.131551
## iter  60 value 29.910970
## iter  70 value 29.906159
## iter  80 value 29.905036
## final  value 29.905019 
## converged
## # weights:  30
## initial  value 105.174038 
## iter  10 value 44.677296
## iter  20 value 37.278535
## iter  30 value 35.851828
## iter  40 value 35.442448
## iter  50 value 35.322088
## iter  60 value 35.319692
## final  value 35.319654 
## converged
## # weights:  30
## initial  value 99.088560 
## iter  10 value 51.338614
## iter  20 value 37.919999
## iter  30 value 36.247226
## iter  40 value 36.148094
## iter  50 value 36.145265
## iter  60 value 35.845307
## iter  70 value 34.111926
## iter  80 value 33.033872
## iter  90 value 32.975550
## iter 100 value 32.975128
## final  value 32.975128 
## stopped after 100 iterations
## # weights:  30
## initial  value 127.929440 
## iter  10 value 99.269484
## iter  20 value 43.738326
## iter  30 value 32.235920
## iter  40 value 31.299708
## iter  50 value 31.244240
## iter  60 value 31.205527
## iter  70 value 31.092650
## iter  80 value 31.061847
## iter  90 value 31.038937
## iter 100 value 31.038595
## final  value 31.038595 
## stopped after 100 iterations
## # weights:  30
## initial  value 105.391938 
## iter  10 value 54.553506
## iter  20 value 35.951629
## iter  30 value 34.036114
## iter  40 value 33.299702
## iter  50 value 32.912639
## iter  60 value 32.713950
## iter  70 value 32.565923
## iter  80 value 32.498843
## final  value 32.498162 
## converged
## # weights:  30
## initial  value 101.916164 
## iter  10 value 60.122482
## iter  20 value 36.442201
## iter  30 value 31.737280
## iter  40 value 31.222660
## iter  50 value 31.100141
## iter  60 value 31.061418
## iter  70 value 31.057596
## iter  80 value 31.057403
## final  value 31.057392 
## converged
## # weights:  30
## initial  value 100.019807 
## iter  10 value 50.453978
## iter  20 value 34.640083
## iter  30 value 33.193100
## iter  40 value 33.048811
## iter  50 value 32.743180
## iter  60 value 32.681233
## iter  70 value 32.623055
## iter  80 value 32.618569
## final  value 32.618560 
## converged
## # weights:  30
## initial  value 99.828154 
## iter  10 value 63.650257
## iter  20 value 40.878525
## iter  30 value 35.708467
## iter  40 value 35.219695
## iter  50 value 35.089435
## iter  60 value 34.620647
## iter  70 value 34.540113
## iter  80 value 34.375723
## iter  90 value 33.609108
## iter 100 value 33.496871
## final  value 33.496871 
## stopped after 100 iterations
## # weights:  30
## initial  value 103.953415 
## iter  10 value 56.479620
## iter  20 value 36.404885
## iter  30 value 35.230850
## iter  40 value 34.779425
## iter  50 value 34.464815
## iter  60 value 34.348600
## iter  70 value 34.283374
## iter  80 value 34.249919
## iter  90 value 34.249128
## final  value 34.249072 
## converged
## [Tune-y] 3: mmce.test.mean=0.0516667; time: 0.0 min
## [Tune-x] 4: size=2; decay=0.0513
## # weights:  21
## initial  value 104.000438 
## iter  10 value 50.848541
## iter  20 value 35.847336
## iter  30 value 35.675386
## iter  40 value 35.674732
## final  value 35.674730 
## converged
## # weights:  21
## initial  value 102.427716 
## iter  10 value 51.129488
## iter  20 value 37.992838
## iter  30 value 36.487567
## iter  40 value 36.199705
## iter  50 value 36.103085
## final  value 36.103024 
## converged
## # weights:  21
## initial  value 110.398442 
## iter  10 value 84.087825
## iter  20 value 40.568558
## iter  30 value 38.177886
## iter  40 value 37.146954
## iter  50 value 35.850791
## iter  60 value 33.081139
## iter  70 value 32.530496
## final  value 32.529854 
## converged
## # weights:  21
## initial  value 106.485235 
## iter  10 value 67.737095
## iter  20 value 40.891643
## iter  30 value 37.847434
## iter  40 value 34.794265
## iter  50 value 34.627184
## iter  60 value 34.610238
## final  value 34.610229 
## converged
## # weights:  21
## initial  value 105.845169 
## iter  10 value 45.251163
## iter  20 value 39.481000
## iter  30 value 35.712567
## iter  40 value 32.918069
## iter  50 value 32.892295
## final  value 32.889845 
## converged
## # weights:  21
## initial  value 99.141160 
## iter  10 value 39.268347
## iter  20 value 35.913979
## iter  30 value 35.454733
## iter  40 value 35.431055
## final  value 35.431005 
## converged
## # weights:  21
## initial  value 99.676919 
## iter  10 value 74.565518
## iter  20 value 34.779652
## iter  30 value 34.197177
## iter  40 value 34.086386
## iter  50 value 34.081858
## iter  60 value 34.081345
## final  value 34.081344 
## converged
## # weights:  21
## initial  value 107.455705 
## iter  10 value 69.742155
## iter  20 value 41.492737
## iter  30 value 38.482574
## iter  40 value 36.680038
## iter  50 value 36.376454
## iter  60 value 36.268440
## final  value 36.268319 
## converged
## # weights:  21
## initial  value 98.004870 
## iter  10 value 49.836161
## iter  20 value 39.945465
## iter  30 value 37.056270
## iter  40 value 36.922362
## iter  50 value 36.919747
## iter  60 value 36.917506
## final  value 36.917494 
## converged
## # weights:  21
## initial  value 106.318876 
## iter  10 value 57.251366
## iter  20 value 40.948523
## iter  30 value 39.974190
## iter  40 value 35.408705
## iter  50 value 35.013856
## iter  60 value 35.012875
## final  value 35.012873 
## converged
## # weights:  21
## initial  value 108.992069 
## iter  10 value 47.567940
## iter  20 value 31.879208
## iter  30 value 31.769020
## iter  40 value 31.767327
## final  value 31.767321 
## converged
## # weights:  21
## initial  value 99.504503 
## iter  10 value 58.945320
## iter  20 value 41.699676
## iter  30 value 37.345502
## iter  40 value 32.663329
## iter  50 value 32.427457
## iter  60 value 32.413274
## final  value 32.413271 
## converged
## # weights:  21
## initial  value 133.711085 
## iter  10 value 96.774005
## iter  20 value 43.336549
## iter  30 value 37.116052
## iter  40 value 36.258367
## iter  50 value 36.169719
## iter  60 value 36.161935
## final  value 36.161931 
## converged
## # weights:  21
## initial  value 104.406434 
## iter  10 value 45.527309
## iter  20 value 34.909793
## iter  30 value 33.657829
## iter  40 value 33.638803
## iter  50 value 33.637445
## final  value 33.637430 
## converged
## # weights:  21
## initial  value 113.916625 
## iter  10 value 67.468432
## iter  20 value 39.347111
## iter  30 value 35.024708
## iter  40 value 34.264106
## iter  50 value 34.236507
## iter  60 value 34.236000
## final  value 34.235999 
## converged
## # weights:  21
## initial  value 113.183681 
## iter  10 value 85.287970
## iter  20 value 37.415159
## iter  30 value 35.687115
## iter  40 value 35.530371
## final  value 35.519348 
## converged
## # weights:  21
## initial  value 104.878362 
## iter  10 value 54.916582
## iter  20 value 37.883668
## iter  30 value 34.136200
## iter  40 value 34.002161
## iter  50 value 33.995257
## iter  60 value 33.994627
## iter  60 value 33.994627
## iter  60 value 33.994627
## final  value 33.994627 
## converged
## # weights:  21
## initial  value 100.741369 
## iter  10 value 54.725599
## iter  20 value 40.048571
## iter  30 value 38.619867
## iter  40 value 37.732935
## iter  50 value 37.722151
## final  value 37.721966 
## converged
## # weights:  21
## initial  value 116.031509 
## iter  10 value 79.438230
## iter  20 value 35.899978
## iter  30 value 32.475271
## iter  40 value 32.318264
## iter  50 value 32.314514
## final  value 32.314472 
## converged
## # weights:  21
## initial  value 108.425013 
## iter  10 value 51.806634
## iter  20 value 41.035120
## iter  30 value 37.666759
## iter  40 value 36.376348
## iter  50 value 36.279443
## final  value 36.274760 
## converged
## # weights:  21
## initial  value 99.280351 
## iter  10 value 48.950232
## iter  20 value 31.767571
## iter  30 value 31.210820
## iter  40 value 31.075545
## final  value 31.064648 
## converged
## # weights:  21
## initial  value 107.528995 
## iter  10 value 64.753875
## iter  20 value 37.109607
## iter  30 value 35.728821
## iter  40 value 35.493879
## iter  50 value 35.492044
## iter  50 value 35.492044
## iter  50 value 35.492044
## final  value 35.492044 
## converged
## # weights:  21
## initial  value 104.754947 
## iter  10 value 72.461777
## iter  20 value 45.709853
## iter  30 value 42.949253
## iter  40 value 41.950213
## iter  50 value 40.464059
## iter  60 value 38.337931
## iter  70 value 38.275502
## final  value 38.275458 
## converged
## # weights:  21
## initial  value 120.211556 
## iter  10 value 95.288635
## iter  20 value 43.689952
## iter  30 value 39.422112
## iter  40 value 37.755306
## iter  50 value 37.648721
## final  value 37.645034 
## converged
## # weights:  21
## initial  value 100.439192 
## iter  10 value 58.283798
## iter  20 value 37.431866
## iter  30 value 34.488304
## iter  40 value 33.918433
## iter  50 value 33.902618
## final  value 33.901993 
## converged
## # weights:  21
## initial  value 111.132865 
## iter  10 value 55.780963
## iter  20 value 41.122737
## iter  30 value 40.170213
## iter  40 value 39.278845
## iter  50 value 36.369999
## iter  60 value 35.123025
## iter  70 value 35.116841
## iter  70 value 35.116841
## iter  70 value 35.116841
## final  value 35.116841 
## converged
## # weights:  21
## initial  value 109.882301 
## iter  10 value 70.028900
## iter  20 value 41.274490
## iter  30 value 39.061499
## iter  40 value 35.941985
## iter  50 value 35.684725
## iter  60 value 35.652783
## final  value 35.652767 
## converged
## # weights:  21
## initial  value 105.096548 
## iter  10 value 49.140096
## iter  20 value 40.950410
## iter  30 value 38.534513
## iter  40 value 33.971587
## iter  50 value 33.452968
## iter  60 value 33.420300
## final  value 33.420291 
## converged
## # weights:  21
## initial  value 104.827521 
## iter  10 value 57.021199
## iter  20 value 41.189796
## iter  30 value 37.037350
## iter  40 value 36.656816
## iter  50 value 36.652941
## iter  60 value 36.651758
## final  value 36.651757 
## converged
## # weights:  21
## initial  value 101.850866 
## iter  10 value 48.318970
## iter  20 value 34.606937
## iter  30 value 34.339643
## iter  40 value 34.293885
## final  value 34.291722 
## converged
## # weights:  21
## initial  value 101.014266 
## iter  10 value 70.027269
## iter  20 value 37.260300
## iter  30 value 35.438442
## iter  40 value 35.021180
## iter  50 value 34.996677
## final  value 34.996503 
## converged
## # weights:  21
## initial  value 124.930926 
## iter  10 value 79.049002
## iter  20 value 39.112790
## iter  30 value 34.498642
## iter  40 value 32.560374
## iter  50 value 31.263198
## iter  60 value 31.156059
## final  value 31.155771 
## converged
## # weights:  21
## initial  value 105.716977 
## iter  10 value 60.749590
## iter  20 value 38.926233
## iter  30 value 36.870818
## iter  40 value 36.850763
## final  value 36.848043 
## converged
## # weights:  21
## initial  value 107.268384 
## iter  10 value 50.252415
## iter  20 value 40.022012
## iter  30 value 38.285801
## iter  40 value 34.751612
## iter  50 value 33.972760
## iter  60 value 33.959306
## final  value 33.959304 
## converged
## # weights:  21
## initial  value 114.798726 
## iter  10 value 67.076961
## iter  20 value 35.930860
## iter  30 value 33.999384
## iter  40 value 33.989637
## iter  50 value 33.988524
## iter  50 value 33.988524
## iter  50 value 33.988524
## final  value 33.988524 
## converged
## # weights:  21
## initial  value 106.187701 
## iter  10 value 48.466384
## iter  20 value 36.573346
## iter  30 value 35.648027
## iter  40 value 35.423278
## iter  50 value 35.352993
## final  value 35.352877 
## converged
## # weights:  21
## initial  value 106.701211 
## iter  10 value 54.243829
## iter  20 value 32.912682
## iter  30 value 32.731157
## iter  40 value 32.719074
## iter  50 value 32.716549
## iter  60 value 32.716096
## final  value 32.716095 
## converged
## # weights:  21
## initial  value 99.549668 
## iter  10 value 59.839953
## iter  20 value 41.047109
## iter  30 value 37.097300
## iter  40 value 36.107616
## iter  50 value 35.354914
## iter  60 value 35.320328
## final  value 35.320320 
## converged
## # weights:  21
## initial  value 100.264139 
## iter  10 value 61.389332
## iter  20 value 35.161165
## iter  30 value 34.662839
## iter  40 value 34.649663
## iter  50 value 34.649239
## final  value 34.649221 
## converged
## # weights:  21
## initial  value 99.789374 
## iter  10 value 52.307130
## iter  20 value 36.725684
## iter  30 value 36.150637
## iter  40 value 36.009400
## iter  50 value 36.009003
## iter  60 value 36.008820
## iter  60 value 36.008820
## iter  60 value 36.008820
## final  value 36.008820 
## converged
## [Tune-y] 4: mmce.test.mean=0.0508333; time: 0.0 min
## [Tune-x] 5: size=2; decay=0.0529
## # weights:  21
## initial  value 100.862360 
## iter  10 value 64.953351
## iter  20 value 40.551040
## iter  30 value 36.892632
## iter  40 value 36.153009
## iter  50 value 36.144807
## final  value 36.144654 
## converged
## # weights:  21
## initial  value 107.254459 
## iter  10 value 54.480740
## iter  20 value 36.483006
## iter  30 value 34.950953
## iter  40 value 34.751849
## iter  50 value 34.735041
## final  value 34.735029 
## converged
## # weights:  21
## initial  value 102.941427 
## iter  10 value 50.277769
## iter  20 value 35.897986
## iter  30 value 33.737163
## iter  40 value 33.150503
## final  value 33.145993 
## converged
## # weights:  21
## initial  value 101.975211 
## iter  10 value 52.908119
## iter  20 value 37.444683
## iter  30 value 37.257785
## iter  40 value 37.251314
## final  value 37.251299 
## converged
## # weights:  21
## initial  value 99.676401 
## iter  10 value 53.383486
## iter  20 value 35.081547
## iter  30 value 33.981463
## iter  40 value 33.957842
## final  value 33.957835 
## converged
## # weights:  21
## initial  value 105.408463 
## iter  10 value 46.035285
## iter  20 value 36.251377
## iter  30 value 35.946304
## iter  40 value 35.925556
## iter  50 value 35.919126
## final  value 35.919109 
## converged
## # weights:  21
## initial  value 100.723575 
## iter  10 value 45.672297
## iter  20 value 36.585837
## iter  30 value 34.920237
## iter  40 value 34.598787
## iter  50 value 34.587848
## iter  60 value 34.586061
## final  value 34.586058 
## converged
## # weights:  21
## initial  value 114.638812 
## iter  10 value 64.960517
## iter  20 value 35.122117
## iter  30 value 35.103092
## iter  40 value 35.101742
## final  value 35.101726 
## converged
## # weights:  21
## initial  value 116.342383 
## iter  10 value 57.417554
## iter  20 value 38.952047
## iter  30 value 38.687793
## iter  40 value 38.579466
## iter  50 value 38.571170
## iter  60 value 38.570168
## iter  60 value 38.570167
## iter  60 value 38.570167
## final  value 38.570167 
## converged
## # weights:  21
## initial  value 102.349882 
## iter  10 value 66.044261
## iter  20 value 36.880034
## iter  30 value 35.507174
## iter  40 value 35.501978
## final  value 35.501486 
## converged
## # weights:  21
## initial  value 98.534505 
## iter  10 value 45.374384
## iter  20 value 35.275671
## iter  30 value 34.139452
## iter  40 value 34.058438
## final  value 34.057949 
## converged
## # weights:  21
## initial  value 107.623702 
## iter  10 value 78.877374
## iter  20 value 36.289263
## iter  30 value 35.408341
## iter  40 value 35.194876
## final  value 35.193530 
## converged
## # weights:  21
## initial  value 101.465299 
## iter  10 value 55.859048
## iter  20 value 37.936314
## iter  30 value 37.321700
## iter  40 value 37.297750
## iter  50 value 37.297643
## final  value 37.297635 
## converged
## # weights:  21
## initial  value 115.058681 
## iter  10 value 44.729700
## iter  20 value 34.898371
## iter  30 value 34.651282
## iter  40 value 34.646305
## iter  50 value 34.645306
## final  value 34.645279 
## converged
## # weights:  21
## initial  value 107.826618 
## iter  10 value 69.238554
## iter  20 value 39.963655
## iter  30 value 35.406281
## iter  40 value 34.838021
## iter  50 value 34.818469
## iter  50 value 34.818469
## iter  50 value 34.818469
## final  value 34.818469 
## converged
## # weights:  21
## initial  value 107.257231 
## iter  10 value 52.021501
## iter  20 value 38.593808
## iter  30 value 35.421529
## iter  40 value 34.787835
## iter  50 value 34.759014
## iter  60 value 34.754060
## final  value 34.754050 
## converged
## # weights:  21
## initial  value 99.162026 
## iter  10 value 47.153686
## iter  20 value 38.816275
## iter  30 value 36.386405
## iter  40 value 36.258750
## iter  50 value 36.251158
## final  value 36.251148 
## converged
## # weights:  21
## initial  value 112.633330 
## iter  10 value 58.062931
## iter  20 value 37.376517
## iter  30 value 36.152688
## iter  40 value 36.119992
## iter  50 value 36.119841
## final  value 36.119834 
## converged
## # weights:  21
## initial  value 113.657742 
## iter  10 value 76.228487
## iter  20 value 39.912643
## iter  30 value 39.281470
## iter  40 value 34.345377
## iter  50 value 33.569635
## iter  60 value 33.478392
## iter  70 value 32.895059
## iter  80 value 32.843756
## final  value 32.843745 
## converged
## # weights:  21
## initial  value 102.334106 
## iter  10 value 46.207843
## iter  20 value 35.111449
## iter  30 value 35.057032
## final  value 35.056020 
## converged
## # weights:  21
## initial  value 104.307902 
## iter  10 value 69.726561
## iter  20 value 38.035255
## iter  30 value 34.267165
## iter  40 value 33.692295
## iter  50 value 33.015780
## iter  60 value 33.013640
## iter  60 value 33.013640
## iter  60 value 33.013640
## final  value 33.013640 
## converged
## # weights:  21
## initial  value 102.707303 
## iter  10 value 52.377290
## iter  20 value 41.498343
## iter  30 value 35.471504
## iter  40 value 35.031880
## iter  50 value 34.920855
## iter  60 value 34.918037
## final  value 34.918033 
## converged
## # weights:  21
## initial  value 105.851044 
## iter  10 value 51.025746
## iter  20 value 38.601540
## iter  30 value 37.464577
## iter  40 value 37.453495
## iter  50 value 37.452789
## final  value 37.452778 
## converged
## # weights:  21
## initial  value 106.341031 
## iter  10 value 58.200644
## iter  20 value 39.784087
## iter  30 value 36.423010
## iter  40 value 36.166243
## iter  50 value 36.165220
## final  value 36.165148 
## converged
## # weights:  21
## initial  value 116.080121 
## iter  10 value 57.818731
## iter  20 value 38.156617
## iter  30 value 35.281346
## iter  40 value 34.486879
## iter  50 value 34.415312
## final  value 34.415217 
## converged
## # weights:  21
## initial  value 100.572486 
## iter  10 value 45.135235
## iter  20 value 37.983198
## iter  30 value 35.909803
## iter  40 value 35.813287
## final  value 35.812541 
## converged
## # weights:  21
## initial  value 107.862879 
## iter  10 value 49.717950
## iter  20 value 36.322270
## iter  30 value 36.181754
## iter  40 value 36.132250
## iter  50 value 36.119890
## final  value 36.119836 
## converged
## # weights:  21
## initial  value 123.082826 
## iter  10 value 88.755395
## iter  20 value 42.672471
## iter  30 value 35.934157
## iter  40 value 35.230242
## iter  50 value 35.217589
## final  value 35.217491 
## converged
## # weights:  21
## initial  value 104.958830 
## iter  10 value 52.028173
## iter  20 value 38.261698
## iter  30 value 36.614455
## iter  40 value 36.477565
## iter  50 value 36.456786
## final  value 36.455522 
## converged
## # weights:  21
## initial  value 111.471590 
## iter  10 value 98.926656
## iter  20 value 60.036727
## iter  30 value 35.036965
## iter  40 value 34.515407
## iter  50 value 34.316781
## iter  60 value 34.152943
## iter  70 value 34.150187
## iter  70 value 34.150187
## iter  70 value 34.150187
## final  value 34.150187 
## converged
## # weights:  21
## initial  value 99.773018 
## iter  10 value 43.557862
## iter  20 value 35.738912
## iter  30 value 35.470482
## iter  40 value 35.467679
## final  value 35.467637 
## converged
## # weights:  21
## initial  value 108.406435 
## iter  10 value 61.484961
## iter  20 value 32.294228
## iter  30 value 31.878804
## iter  40 value 31.750423
## iter  50 value 31.689032
## final  value 31.686307 
## converged
## # weights:  21
## initial  value 114.424687 
## iter  10 value 56.611537
## iter  20 value 43.141412
## iter  30 value 38.861488
## iter  40 value 38.634576
## iter  50 value 38.603129
## final  value 38.602314 
## converged
## # weights:  21
## initial  value 98.654160 
## iter  10 value 59.637955
## iter  20 value 40.033756
## iter  30 value 36.504474
## iter  40 value 34.678074
## iter  50 value 34.446493
## final  value 34.445673 
## converged
## # weights:  21
## initial  value 109.764571 
## iter  10 value 68.760242
## iter  20 value 38.538981
## iter  30 value 37.273895
## iter  40 value 35.102427
## iter  50 value 34.876616
## iter  60 value 34.876112
## final  value 34.876109 
## converged
## # weights:  21
## initial  value 107.165247 
## iter  10 value 57.371296
## iter  20 value 40.494194
## iter  30 value 36.432587
## iter  40 value 34.139931
## iter  50 value 34.105835
## iter  60 value 34.105474
## iter  60 value 34.105474
## iter  60 value 34.105474
## final  value 34.105474 
## converged
## # weights:  21
## initial  value 109.335505 
## iter  10 value 54.306988
## iter  20 value 34.280038
## iter  30 value 33.701822
## iter  40 value 32.832083
## iter  50 value 32.704948
## iter  60 value 32.670340
## final  value 32.670324 
## converged
## # weights:  21
## initial  value 103.641654 
## iter  10 value 43.241178
## iter  20 value 35.571223
## iter  30 value 34.123677
## iter  40 value 34.069779
## iter  50 value 34.068608
## final  value 34.068548 
## converged
## # weights:  21
## initial  value 100.183390 
## iter  10 value 53.120164
## iter  20 value 36.112688
## iter  30 value 35.753127
## iter  40 value 35.752693
## final  value 35.752675 
## converged
## # weights:  21
## initial  value 100.894318 
## iter  10 value 62.856389
## iter  20 value 45.310833
## iter  30 value 40.393090
## iter  40 value 37.928037
## iter  50 value 37.837428
## final  value 37.836804 
## converged
## [Tune-y] 5: mmce.test.mean=0.0483333; time: 0.0 min
## [Tune-x] 6: size=2; decay=0.0489
## # weights:  21
## initial  value 117.861190 
## iter  10 value 90.250219
## iter  20 value 40.742923
## iter  30 value 39.022962
## iter  40 value 35.361635
## iter  50 value 34.996072
## iter  60 value 34.937848
## iter  70 value 34.935984
## iter  70 value 34.935984
## iter  70 value 34.935984
## final  value 34.935984 
## converged
## # weights:  21
## initial  value 106.029194 
## iter  10 value 48.466427
## iter  20 value 39.708246
## iter  30 value 38.490421
## iter  40 value 36.670063
## iter  50 value 35.307432
## iter  60 value 35.285534
## final  value 35.285531 
## converged
## # weights:  21
## initial  value 106.849317 
## iter  10 value 43.415825
## iter  20 value 32.869017
## iter  30 value 31.624670
## iter  40 value 31.561002
## final  value 31.560960 
## converged
## # weights:  21
## initial  value 105.587009 
## iter  10 value 51.691826
## iter  20 value 36.692726
## iter  30 value 35.524574
## iter  40 value 35.465896
## iter  50 value 35.465345
## final  value 35.465338 
## converged
## # weights:  21
## initial  value 120.143330 
## iter  10 value 72.997709
## iter  20 value 37.570891
## iter  30 value 35.107413
## iter  40 value 33.545031
## iter  50 value 32.163737
## iter  60 value 32.059405
## final  value 32.059394 
## converged
## # weights:  21
## initial  value 100.603201 
## iter  10 value 41.441109
## iter  20 value 35.366440
## iter  30 value 34.112787
## iter  40 value 34.095898
## iter  50 value 34.093070
## final  value 34.093068 
## converged
## # weights:  21
## initial  value 106.297463 
## iter  10 value 53.175482
## iter  20 value 36.927581
## iter  30 value 35.347907
## iter  40 value 35.036263
## final  value 34.963294 
## converged
## # weights:  21
## initial  value 105.619666 
## iter  10 value 58.745873
## iter  20 value 39.208407
## iter  30 value 36.240766
## iter  40 value 35.738948
## iter  50 value 35.725860
## iter  60 value 35.724692
## iter  60 value 35.724691
## iter  60 value 35.724691
## final  value 35.724691 
## converged
## # weights:  21
## initial  value 105.583887 
## iter  10 value 65.491055
## iter  20 value 36.513906
## iter  30 value 36.157907
## iter  40 value 36.135490
## iter  50 value 36.131031
## iter  60 value 36.130601
## final  value 36.130597 
## converged
## # weights:  21
## initial  value 113.978501 
## iter  10 value 52.173477
## iter  20 value 41.085604
## iter  30 value 36.491546
## iter  40 value 36.011331
## iter  50 value 36.004845
## final  value 36.004586 
## converged
## # weights:  21
## initial  value 105.133454 
## iter  10 value 60.851301
## iter  20 value 31.329019
## iter  30 value 31.009777
## iter  40 value 30.984210
## final  value 30.982406 
## converged
## # weights:  21
## initial  value 106.497975 
## iter  10 value 60.249416
## iter  20 value 32.296689
## iter  30 value 31.747834
## iter  40 value 31.610813
## iter  50 value 31.593404
## final  value 31.593381 
## converged
## # weights:  21
## initial  value 101.208259 
## iter  10 value 67.447495
## iter  20 value 37.185321
## iter  30 value 35.425818
## iter  40 value 35.404260
## final  value 35.403925 
## converged
## # weights:  21
## initial  value 100.451591 
## iter  10 value 48.968317
## iter  20 value 36.763274
## iter  30 value 34.622080
## iter  40 value 34.349873
## iter  50 value 34.309222
## final  value 34.309144 
## converged
## # weights:  21
## initial  value 100.650602 
## iter  10 value 52.740871
## iter  20 value 42.093963
## iter  30 value 32.946364
## iter  40 value 32.285787
## iter  50 value 32.155022
## iter  60 value 32.151185
## final  value 32.151162 
## converged
## # weights:  21
## initial  value 98.768157 
## iter  10 value 48.385739
## iter  20 value 33.368215
## iter  30 value 32.798120
## iter  40 value 32.756773
## final  value 32.756531 
## converged
## # weights:  21
## initial  value 101.478353 
## iter  10 value 52.312703
## iter  20 value 38.295034
## iter  30 value 34.842349
## iter  40 value 33.820825
## iter  50 value 33.782484
## final  value 33.782128 
## converged
## # weights:  21
## initial  value 111.550391 
## iter  10 value 65.220065
## iter  20 value 42.870568
## iter  30 value 37.915712
## iter  40 value 36.829917
## iter  50 value 36.627020
## iter  60 value 36.608771
## final  value 36.608770 
## converged
## # weights:  21
## initial  value 107.525522 
## iter  10 value 46.562441
## iter  20 value 33.120633
## iter  30 value 32.376418
## iter  40 value 32.131337
## iter  50 value 32.125770
## iter  60 value 32.125726
## final  value 32.125725 
## converged
## # weights:  21
## initial  value 102.744198 
## iter  10 value 66.948651
## iter  20 value 40.331163
## iter  30 value 38.410295
## iter  40 value 36.310764
## iter  50 value 35.528870
## iter  60 value 35.431026
## final  value 35.430971 
## converged
## # weights:  21
## initial  value 101.111982 
## iter  10 value 56.462157
## iter  20 value 35.550679
## iter  30 value 31.062562
## iter  40 value 29.885778
## iter  50 value 29.628286
## iter  60 value 29.624819
## iter  60 value 29.624818
## iter  60 value 29.624818
## final  value 29.624818 
## converged
## # weights:  21
## initial  value 100.897310 
## iter  10 value 48.549823
## iter  20 value 33.375943
## iter  30 value 33.222116
## iter  40 value 33.220556
## final  value 33.220540 
## converged
## # weights:  21
## initial  value 100.757238 
## iter  10 value 75.770837
## iter  20 value 38.507960
## iter  30 value 37.909705
## iter  40 value 37.734341
## final  value 37.732562 
## converged
## # weights:  21
## initial  value 105.558485 
## iter  10 value 71.933908
## iter  20 value 40.920310
## iter  30 value 40.108249
## iter  40 value 37.587752
## iter  50 value 35.568456
## iter  60 value 34.891830
## final  value 34.891419 
## converged
## # weights:  21
## initial  value 106.674563 
## iter  10 value 42.910153
## iter  20 value 34.317607
## iter  30 value 33.809851
## iter  40 value 33.731234
## iter  50 value 33.716329
## final  value 33.716327 
## converged
## # weights:  21
## initial  value 104.522037 
## iter  10 value 51.747258
## iter  20 value 39.299911
## iter  30 value 35.774372
## iter  40 value 34.593344
## iter  50 value 34.231249
## iter  60 value 34.188912
## final  value 34.188907 
## converged
## # weights:  21
## initial  value 118.934604 
## iter  10 value 81.500941
## iter  20 value 35.410736
## iter  30 value 35.066602
## iter  40 value 34.938320
## iter  50 value 34.919249
## iter  60 value 34.918731
## iter  60 value 34.918731
## iter  60 value 34.918731
## final  value 34.918731 
## converged
## # weights:  21
## initial  value 102.416168 
## iter  10 value 56.403492
## iter  20 value 35.603010
## iter  30 value 32.772077
## iter  40 value 32.624629
## final  value 32.623873 
## converged
## # weights:  21
## initial  value 100.447528 
## iter  10 value 46.796222
## iter  20 value 39.865526
## iter  30 value 37.788226
## iter  40 value 37.473390
## iter  50 value 37.428652
## final  value 37.427723 
## converged
## # weights:  21
## initial  value 101.723900 
## iter  10 value 38.734327
## iter  20 value 33.814560
## iter  30 value 33.638394
## iter  40 value 33.516836
## iter  50 value 33.456297
## iter  60 value 33.455376
## final  value 33.455359 
## converged
## # weights:  21
## initial  value 99.974782 
## iter  10 value 57.879650
## iter  20 value 38.740902
## iter  30 value 36.587949
## iter  40 value 36.042494
## iter  50 value 36.033150
## final  value 36.033044 
## converged
## # weights:  21
## initial  value 103.013515 
## iter  10 value 59.086022
## iter  20 value 36.096015
## iter  30 value 31.469181
## iter  40 value 30.486229
## iter  50 value 30.320742
## final  value 30.320578 
## converged
## # weights:  21
## initial  value 103.743890 
## iter  10 value 47.326807
## iter  20 value 40.628897
## iter  30 value 38.043742
## iter  40 value 37.689187
## iter  50 value 37.564948
## iter  60 value 37.556020
## final  value 37.556016 
## converged
## # weights:  21
## initial  value 101.430802 
## iter  10 value 54.959616
## iter  20 value 39.914886
## iter  30 value 37.570304
## iter  40 value 35.858085
## iter  50 value 35.313224
## iter  60 value 35.236166
## final  value 35.236147 
## converged
## # weights:  21
## initial  value 107.061360 
## iter  10 value 81.303849
## iter  20 value 34.618984
## iter  30 value 31.533948
## iter  40 value 31.404199
## iter  50 value 31.401084
## final  value 31.400995 
## converged
## # weights:  21
## initial  value 110.955782 
## iter  10 value 71.959721
## iter  20 value 33.737680
## iter  30 value 32.875519
## iter  40 value 32.804143
## iter  50 value 32.793238
## iter  50 value 32.793238
## final  value 32.793238 
## converged
## # weights:  21
## initial  value 103.713361 
## iter  10 value 58.606061
## iter  20 value 37.595633
## iter  30 value 32.990052
## iter  40 value 31.978047
## iter  50 value 31.890762
## iter  60 value 31.888101
## final  value 31.888085 
## converged
## # weights:  21
## initial  value 102.503762 
## iter  10 value 56.618728
## iter  20 value 38.240872
## iter  30 value 33.845350
## iter  40 value 33.110492
## iter  50 value 32.848715
## iter  60 value 32.835273
## iter  60 value 32.835273
## iter  60 value 32.835273
## final  value 32.835273 
## converged
## # weights:  21
## initial  value 103.516994 
## iter  10 value 64.022834
## iter  20 value 43.091371
## iter  30 value 40.513197
## iter  40 value 34.595830
## iter  50 value 33.872557
## iter  60 value 33.845544
## final  value 33.845540 
## converged
## # weights:  21
## initial  value 119.125959 
## iter  10 value 72.927327
## iter  20 value 38.987999
## iter  30 value 36.871837
## iter  40 value 36.789092
## iter  50 value 36.778489
## final  value 36.777773 
## converged
## [Tune-y] 6: mmce.test.mean=0.0516667; time: 0.0 min
## [Tune] Result: size=2; decay=0.0529 : mmce.test.mean=0.0483333
tictoc::toc()
## 4.21 sec elapsed
task <- makeClassifTask(data = knowledge_train_data, target = "UNS")
## Warning in makeTask(type = type, data = data, weights = weights, blocking
## = blocking, : Provided data is not a pure data.frame but from class tbl_df,
## hence it will be converted.
lrn <- makeLearner(cl = "classif.rpart", fix.factors.prediction = TRUE)
param_set <- makeParamSet(makeIntegerParam("minsplit", lower = 1, upper = 30),
                          makeIntegerParam("minbucket", lower = 1, upper = 30),
                          makeIntegerParam("maxdepth", lower = 3, upper = 10)
                          )
ctrl_random <- makeTuneControlRandom(maxit = 10)


# Create holdout sampling
holdout <- makeResampleDesc("Holdout")

# Perform tuning
lrn_tune <- tuneParams(learner = lrn, task = task, resampling = holdout, 
                       control = ctrl_random, par.set = param_set
                       )
## [Tune] Started tuning learner classif.rpart for parameter set:
##              Type len Def  Constr Req Tunable Trafo
## minsplit  integer   -   - 1 to 30   -    TRUE     -
## minbucket integer   -   - 1 to 30   -    TRUE     -
## maxdepth  integer   -   - 3 to 10   -    TRUE     -
## With control class: TuneControlRandom
## Imputation value: 1
## [Tune-x] 1: minsplit=5; minbucket=23; maxdepth=7
## [Tune-y] 1: mmce.test.mean=0.1000000; time: 0.0 min
## [Tune-x] 2: minsplit=20; minbucket=29; maxdepth=6
## [Tune-y] 2: mmce.test.mean=0.4500000; time: 0.0 min
## [Tune-x] 3: minsplit=6; minbucket=29; maxdepth=9
## [Tune-y] 3: mmce.test.mean=0.4500000; time: 0.0 min
## [Tune-x] 4: minsplit=29; minbucket=17; maxdepth=7
## [Tune-y] 4: mmce.test.mean=0.1000000; time: 0.0 min
## [Tune-x] 5: minsplit=20; minbucket=1; maxdepth=8
## [Tune-y] 5: mmce.test.mean=0.0750000; time: 0.0 min
## [Tune-x] 6: minsplit=15; minbucket=16; maxdepth=10
## [Tune-y] 6: mmce.test.mean=0.1000000; time: 0.0 min
## [Tune-x] 7: minsplit=11; minbucket=5; maxdepth=3
## [Tune-y] 7: mmce.test.mean=0.0750000; time: 0.0 min
## [Tune-x] 8: minsplit=10; minbucket=9; maxdepth=8
## [Tune-y] 8: mmce.test.mean=0.1000000; time: 0.0 min
## [Tune-x] 9: minsplit=21; minbucket=11; maxdepth=7
## [Tune-y] 9: mmce.test.mean=0.1000000; time: 0.0 min
## [Tune-x] 10: minsplit=19; minbucket=14; maxdepth=9
## [Tune-y] 10: mmce.test.mean=0.1000000; time: 0.0 min
## [Tune] Result: minsplit=11; minbucket=5; maxdepth=3 : mmce.test.mean=0.0750000
# Generate hyperparameter effect data
hyperpar_effects <- generateHyperParsEffectData(lrn_tune, partial.dep = TRUE)

# Plot hyperparameter effects
plotHyperParsEffect(hyperpar_effects, partial.dep.learn = "regr.glm",
                    x = "minsplit", y = "mmce.test.mean", z = "maxdepth", plot.type = "line"
                    )
## Loading required package: mmpf

task <- makeClassifTask(data = knowledge_train_data, target = "UNS")
## Warning in makeTask(type = type, data = data, weights = weights, blocking
## = blocking, : Provided data is not a pure data.frame but from class tbl_df,
## hence it will be converted.
lrn <- makeLearner(cl = "classif.nnet", fix.factors.prediction = TRUE)
param_set <- makeParamSet(makeIntegerParam("size", lower = 1, upper = 5),
                          makeIntegerParam("maxit", lower = 1, upper = 300),
                          makeNumericParam("decay", lower = 0.0001, upper = 1)
                          )
ctrl_random <- makeTuneControlRandom(maxit = 10)


# Create holdout sampling
holdout <- makeResampleDesc("Holdout", predict = "both")

# Perform tuning
lrn_tune <- tuneParams(learner = lrn, task = task, resampling = holdout, control = ctrl_random, 
                       par.set = param_set,
                       measures = list(mmce, setAggregation(mmce, train.sd), 
                                       acc, setAggregation(acc, train.sd)
                                       )
                       )
## [Tune] Started tuning learner classif.nnet for parameter set:
##          Type len Def      Constr Req Tunable Trafo
## size  integer   -   -      1 to 5   -    TRUE     -
## maxit integer   -   -    1 to 300   -    TRUE     -
## decay numeric   -   - 0.0001 to 1   -    TRUE     -
## With control class: TuneControlRandom
## Imputation value: 1Imputation value: InfImputation value: -0Imputation value: Inf
## [Tune-x] 1: size=3; maxit=291; decay=0.337
## # weights:  30
## initial  value 90.017712 
## iter  10 value 69.909105
## iter  20 value 68.519082
## iter  30 value 68.489080
## final  value 68.488869 
## converged
## [Tune-y] 1: mmce.test.mean=0.0750000,mmce.train.sd=      NA,acc.test.mean=0.9250000,acc.train.sd=      NA; time: 0.0 min
## [Tune-x] 2: size=4; maxit=297; decay=0.926
## # weights:  39
## initial  value 102.874652 
## iter  10 value 86.129571
## iter  20 value 85.812356
## iter  30 value 85.587055
## iter  40 value 85.492677
## final  value 85.492517 
## converged
## [Tune-y] 2: mmce.test.mean=0.3000000,mmce.train.sd=      NA,acc.test.mean=0.7000000,acc.train.sd=      NA; time: 0.0 min
## [Tune-x] 3: size=3; maxit=107; decay=0.364
## # weights:  30
## initial  value 90.002254 
## iter  10 value 73.795890
## iter  20 value 70.600815
## iter  30 value 70.471580
## iter  40 value 70.457018
## final  value 70.456414 
## converged
## [Tune-y] 3: mmce.test.mean=0.0750000,mmce.train.sd=      NA,acc.test.mean=0.9250000,acc.train.sd=      NA; time: 0.0 min
## [Tune-x] 4: size=1; maxit=98; decay=0.888
## # weights:  12
## initial  value 97.025873 
## iter  10 value 86.577790
## final  value 86.494148 
## converged
## [Tune-y] 4: mmce.test.mean=0.3000000,mmce.train.sd=      NA,acc.test.mean=0.7000000,acc.train.sd=      NA; time: 0.0 min
## [Tune-x] 5: size=3; maxit=214; decay=0.438
## # weights:  30
## initial  value 110.918770 
## iter  10 value 81.202165
## iter  20 value 75.915193
## iter  30 value 74.361787
## iter  40 value 74.182732
## iter  50 value 74.168126
## final  value 74.168124 
## converged
## [Tune-y] 5: mmce.test.mean=0.1250000,mmce.train.sd=      NA,acc.test.mean=0.8750000,acc.train.sd=      NA; time: 0.0 min
## [Tune-x] 6: size=2; maxit=152; decay=0.0457
## # weights:  21
## initial  value 89.898037 
## iter  10 value 38.842565
## iter  20 value 29.586591
## iter  30 value 29.071818
## iter  40 value 28.994512
## iter  50 value 28.991637
## final  value 28.991635 
## converged
## [Tune-y] 6: mmce.test.mean=0.0500000,mmce.train.sd=      NA,acc.test.mean=0.9500000,acc.train.sd=      NA; time: 0.0 min
## [Tune-x] 7: size=2; maxit=192; decay=0.339
## # weights:  21
## initial  value 93.481108 
## iter  10 value 73.030253
## iter  20 value 69.268595
## final  value 69.235349 
## converged
## [Tune-y] 7: mmce.test.mean=0.0750000,mmce.train.sd=      NA,acc.test.mean=0.9250000,acc.train.sd=      NA; time: 0.0 min
## [Tune-x] 8: size=1; maxit=126; decay=0.082
## # weights:  12
## initial  value 98.881194 
## iter  10 value 71.943449
## iter  20 value 45.171255
## iter  30 value 45.096613
## final  value 45.096453 
## converged
## [Tune-y] 8: mmce.test.mean=0.0500000,mmce.train.sd=      NA,acc.test.mean=0.9500000,acc.train.sd=      NA; time: 0.0 min
## [Tune-x] 9: size=5; maxit=133; decay=0.804
## # weights:  48
## initial  value 111.225861 
## iter  10 value 84.368740
## iter  20 value 83.610365
## iter  30 value 83.570240
## final  value 83.570131 
## converged
## [Tune-y] 9: mmce.test.mean=0.3000000,mmce.train.sd=      NA,acc.test.mean=0.7000000,acc.train.sd=      NA; time: 0.0 min
## [Tune-x] 10: size=5; maxit=222; decay=0.87
## # weights:  48
## initial  value 126.396017 
## iter  10 value 85.692185
## iter  20 value 84.765681
## iter  30 value 84.749695
## final  value 84.749682 
## converged
## [Tune-y] 10: mmce.test.mean=0.3000000,mmce.train.sd=      NA,acc.test.mean=0.7000000,acc.train.sd=      NA; time: 0.0 min
## [Tune] Result: size=2; maxit=152; decay=0.0457 : mmce.test.mean=0.0500000,mmce.train.sd=      NA,acc.test.mean=0.9500000,acc.train.sd=      NA
task <- makeClassifTask(data = knowledge_train_data, target = "UNS")
## Warning in makeTask(type = type, data = data, weights = weights, blocking
## = blocking, : Provided data is not a pure data.frame but from class tbl_df,
## hence it will be converted.
lrn <- makeLearner(cl = "classif.nnet", fix.factors.prediction = TRUE)


# Set hyperparameters
lrn_best <- setHyperPars(lrn, par.vals = list(size=1, maxit = 150, decay = 0))

# Train model
model_best <- train(lrn_best, task)
## # weights:  12
## initial  value 132.424063 
## iter  10 value 73.390690
## iter  20 value 23.114640
## iter  30 value 13.856769
## iter  40 value 13.691907
## iter  50 value 13.246583
## iter  60 value 13.245951
## iter  70 value 13.236257
## iter  80 value 13.106557
## iter  90 value 13.105081
## iter 100 value 13.104867
## iter 110 value 13.099957
## iter 120 value 13.099780
## iter 130 value 13.095653
## iter 140 value 13.094127
## iter 150 value 13.094025
## final  value 13.094025 
## stopped after 150 iterations

Chapter 4 - Hyperparameter Tuning with h2o

Machine Learning with h2o:

  • The h2o package in an open-source package that is designed for scalability (distributed clusters)
  • Need to start by initiating a cluster
    • library(h2o)
    • h2o.init()
  • Need to then convert the data to an h2o file and define the feature and target names
    • glimpse(seeds_data)
    • seeds_data_hf <- as.h2o(seeds_data)
    • y <- “seed_type”
    • x <- setdiff(colnames(seeds_data_hf), y)
    • seeds_data_hf[, y] <- as.factor(seeds_data_hf[, y])
  • Can also create train, test, and validation sets
    • sframe <- h2o.splitFrame(data = seeds_data_hf, ratios = c(0.7, 0.15), seed = 42)
    • train <- sframe[[1]]
    • valid <- sframe[[2]]
    • test <- sframe[[3]]
    • summary(train$seed_type, exact_quantiles = TRUE)
  • There are many model algorithms available in h2o
    • Gradient Boosted models with h2o.gbm() & h2o.xgboost()
    • Generalized linear models with h2o.glm()
    • Random Forest models with h2o.randomForest()
    • Neural Networks with h2o.deeplearning()
  • Example of running GBM using h2o
    • gbm_model <- h2o.gbm(x = x, y = y, training_frame = train, validation_frame = valid)
    • perf <- h2o.performance(gbm_model, test)
    • h2o.confusionMatrix(perf)
    • h2o.logloss(perf)
    • h2o.predict(gbm_model, test)

Grid and random search with h2o:

  • Can use the help files to obtain all of the available hyperparameters
    • ?h2o.gbm
  • Brief recap of data preparation steps
    • seeds_data_hf <- as.h2o(seeds_data)
    • y <- “seed_type”
    • x <- setdiff(colnames(seeds_data_hf), y)
    • sframe <- h2o.splitFrame(data = seeds_data_hf, ratios = c(0.7, 0.15), seed = 42)
    • train <- sframe[[1]]
    • valid <- sframe[[2]]
    • test <- sframe[[3]]
  • Can then define a hyperparameter grid
    • gbm_params <- list(ntrees = c(100, 150, 200), max_depth = c(3, 5, 7), learn_rate = c(0.001, 0.01, 0.1))
    • gbm_grid <- h2o.grid(“gbm”, grid_id = “gbm_grid”, x = x, y = y, training_frame = train, validation_frame = valid, seed = 42, hyper_params = gbm_params)
    • gbm_gridperf <- h2o.getGrid(grid_id = “gbm_grid”, sort_by = “accuracy”, decreasing = TRUE)
    • best_gbm <- h2o.getModel(gbm_gridperf@model_ids[[1]]) # Top GBM model chosen by validation accuracy has id position 1 due to the decreasing accuracy sort above
    • print(best_gbm@model[[“model_summary”]])
    • h2o.performance(best_gbm, test)
  • Can also run random search using h2o
    • gbm_params <- list(ntrees = c(100, 150, 200), max_depth = c(3, 5, 7), learn_rate = c(0.001, 0.01, 0.1))
    • search_criteria <- list(strategy = “RandomDiscrete”, max_runtime_secs = 60, seed = 42)
    • gbm_grid <- h2o.grid(“gbm”, grid_id = “gbm_grid”, x = x, y = y, training_frame = train, validation_frame = valid, seed = 42, hyper_params = gbm_params, search_criteria = search_criteria)
    • search_criteria <- list(strategy = “RandomDiscrete”, stopping_metric = “mean_per_class_error”, stopping_tolerance = 0.0001, stopping_rounds = 6)
    • gbm_grid <- h2o.grid(“gbm”, x = x, y = y, training_frame = train, validation_frame = valid, seed = 42, hyper_params = gbm_params, search_criteria = search_criteria)

Automatic machine learning with h2o:

  • Can also run automatic machine learning algorithms in h2o, merely by specifying a total run time
    • Generalized Linear Model (GLM)
    • (Distributed) Random Forest (DRF)
    • Extremely Randomized Trees (XRT)
    • Extreme Gradient Boosting (XGBoost)
    • Gradient Boosting Machines (GBM)
    • Deep Learning (fully-connected multi-layer artificial neural network)
    • Stacked Ensembles (of all models & of best of family)
  • Can tune the hyperparameters prior to the full automatic machine learning algorithm runs (for GBM and deep learning)
    • automl_model <- h2o.automl(x = x, y = y, training_frame = train, validation_frame = valid, max_runtime_secs = 60, sort_metric = “logloss”, seed = 42)
  • Can view the leaderboard

Wrap up:

  • Hyperparameters and benefits of tuning in machine learning models
    • Cartesian Grid Search
    • Random Search
    • Adaptive Resampling
    • Automatic Machine Learning
    • Evaluating tuning results with performance metrics
    • Stopping criteria

Example code includes:

# Code runs OK in Console, does not run in knitr
library(mlr)

vecSeed <- c(15.26, 14.29, 13.84, 16.14, 14.38, 14.69, 15.26, 13.89, 13.78, 13.74, 14.59, 13.99, 15.69, 14.7, 12.72, 14.11, 15.01, 13.02, 14.11, 13.45, 13.16, 15.49, 14.09, 13.94, 15.05, 17.08, 14.8, 13.5, 13.16, 15.5, 13.8, 15.36, 14.99, 14.43, 15.78, 17.63, 16.84, 19.11, 16.82, 16.77, 20.71, 17.12, 18.72, 20.2, 19.57, 19.51, 18.88, 18.98, 20.88, 18.81, 18.59, 18.36, 16.87, 18.17, 18.72, 19.46, 19.18, 18.95, 18.83, 17.63, 18.55, 18.45, 19.38, 19.13, 19.14, 20.97, 19.06, 18.96, 19.15, 20.24, 13.07, 13.34, 12.22, 11.82, 11.21, 11.43, 12.49, 10.79, 11.83, 12.01, 12.26, 11.18, 11.19, 11.34, 11.75, 11.49, 12.54, 12.02, 12.05, 12.55, 11.14, 12.1, 12.15, 10.8, 11.26, 11.41, 12.46, 12.19, 11.65, 11.56, 11.81, 10.91, 11.23, 11.27, 11.87, 14.84, 14.09, 13.94, 14.99, 14.21, 14.49, 14.85, 14.02, 14.06, 14.05, 14.28, 13.83, 14.75, 14.21, 13.57, 14.26, 14.76, 13.76, 14.18, 14.02, 13.82, 14.94, 14.41, 14.17, 14.68, 15.38, 14.52, 13.85, 13.55, 14.86, 14.04, 14.76, 14.56, 14.4, 14.91, 15.98, 15.67, 16.26, 15.51, 15.62, 17.23, 15.55, 16.19, 16.89, 16.74, 16.71, 16.26, 16.66, 17.05, 16.29, 16.05, 16.52, 15.65, 16.26, 16.34, 16.5, 16.63, 16.42, 16.29, 15.86, 16.22, 16.12, 16.72, 16.31, 16.61, 17.25, 16.45, 16.2, 16.45, 16.91, 13.92, 13.95, 13.32, 13.4, 13.13, 13.13, 13.46, 12.93, 13.23, 13.52, 13.6, 13.04, 13.05, 12.87, 13.52, 13.22, 13.67, 13.33, 13.41, 13.57, 12.79, 13.15, 13.45, 12.57, 13.01, 12.95, 13.41, 13.36, 13.07, 13.31, 13.45, 12.8, 12.82, 12.86, 13.02, 0.87, 0.9, 0.9, 0.9, 0.9, 0.88, 0.87, 0.89, 0.88, 0.87, 0.9, 0.92, 0.91, 0.92, 0.87, 0.87, 0.87, 0.86, 0.88, 0.86, 0.87, 0.87, 0.85, 0.87, 0.88, 0.91, 0.88, 0.89, 0.9, 0.88, 0.88, 0.89, 0.89, 0.88, 0.89, 0.87, 0.86, 0.91, 0.88, 0.86, 0.88, 0.89, 0.9, 0.89, 0.88, 0.88, 0.9, 0.86, 0.9, 0.89, 0.91, 0.85, 0.86, 0.86, 0.88, 0.9, 0.87, 0.88, 0.89, 0.88, 0.89, 0.89, 0.87, 0.9, 0.87, 0.89, 0.89, 0.91, 0.89, 0.89, 0.85, 0.86, 0.87, 0.83, 0.82, 0.83, 0.87, 0.81, 0.85, 0.82, 0.83, 0.83, 0.83, 0.86, 0.81, 0.83, 0.84, 0.85, 0.84, 0.86, 0.86, 0.88, 0.84, 0.86, 0.84, 0.86, 0.87, 0.86, 0.86, 0.82, 0.82, 0.84, 0.86, 0.86, 0.88, 5.76, 5.29, 5.32, 5.66, 5.39, 5.56, 5.71, 5.44, 5.48, 5.48, 5.35, 5.12, 5.53, 5.21, 5.23, 5.52, 5.79, 5.39, 5.54, 5.52, 5.45, 5.76, 5.72, 5.58, 5.71, 5.83, 5.66, 5.35, 5.14, 5.88, 5.38, 5.7, 5.57, 5.58, 5.67, 6.19, 6, 6.15, 6.02, 5.93, 6.58, 5.85, 6.01, 6.29, 6.38, 6.37, 6.08, 6.55, 6.45, 6.27, 6.04, 6.67, 6.14, 6.27, 6.22, 6.11, 6.37, 6.25, 6.04, 6.03, 6.15, 6.11, 6.3, 6.18, 6.26, 6.56, 6.42, 6.05, 6.25, 6.32, 5.47, 5.39, 5.22, 5.31, 5.28, 5.18, 5.27, 5.32, 5.26, 5.41, 5.41, 5.22, 5.25, 5.05, 5.44, 5.3, 5.45, 5.35, 5.27, 5.33, 5.01, 5.11, 5.42, 4.98, 5.19, 5.09, 5.24, 5.24, 5.11, 5.36, 5.41, 5.09, 5.09, 5.09, 5.13, 3.31, 3.34, 3.38, 3.56, 3.31, 3.26, 3.24, 3.2, 3.16, 3.11, 3.33, 3.38, 3.51, 3.47, 3.05, 3.17, 3.25, 3.03, 3.22, 3.06, 2.98, 3.37, 3.19, 3.15, 3.33)
vecSeed <- c(vecSeed, 3.68, 3.29, 3.16, 3.2, 3.4, 3.15, 3.39, 3.38, 3.27, 3.43, 3.56, 3.48, 3.93, 3.49, 3.44, 3.81, 3.57, 3.86, 3.86, 3.77, 3.8, 3.76, 3.67, 4.03, 3.69, 3.86, 3.48, 3.46, 3.51, 3.68, 3.89, 3.68, 3.75, 3.79, 3.57, 3.67, 3.77, 3.79, 3.9, 3.74, 3.99, 3.72, 3.9, 3.82, 3.96, 2.99, 3.07, 2.97, 2.78, 2.69, 2.72, 2.97, 2.65, 2.84, 2.78, 2.83, 2.69, 2.67, 2.85, 2.68, 2.69, 2.88, 2.81, 2.85, 2.97, 2.79, 2.94, 2.84, 2.82, 2.71, 2.77, 3.02, 2.91, 2.85, 2.68, 2.72, 2.67, 2.82, 2.8, 2.95, 2.22, 2.7, 2.26, 1.36, 2.46, 3.59, 4.54, 3.99, 3.14, 2.93, 4.18, 5.23, 1.6, 1.77, 4.1, 2.69, 1.79, 3.37, 2.75, 3.53, 0.86, 3.41, 3.92, 2.12, 2.13, 2.96, 3.11, 2.25, 2.46, 4.71, 1.56, 1.37, 2.96, 3.98, 5.59, 4.08, 4.67, 2.94, 4, 4.92, 4.45, 2.86, 5.32, 5.17, 1.47, 2.96, 1.65, 3.69, 5.02, 3.24, 6, 4.93, 3.7, 2.85, 2.19, 4.31, 3.36, 3.37, 2.55, 3.75, 1.74, 2.23, 3.68, 2.11, 6.68, 4.68, 2.25, 4.33, 3.08, 5.9, 5.3, 6, 5.47, 4.47, 6.17, 2.22, 4.42, 5.46, 5.2, 6.99, 4.76, 3.33, 5.81, 3.35, 4.38, 5.39, 3.08, 4.27, 4.99, 4.42, 6.39, 2.2, 3.64, 4.77, 5.34, 4.96, 4.99, 4.86, 5.21, 4.06, 4.9, 4.18, 7.52, 3.98, 3.6, 5.22, 4.83, 4.8, 5.17, 4.96, 5.22, 5.31, 4.74, 4.87, 4.83, 4.78, 4.78, 5.05, 4.65, 4.91, 5.22, 5, 4.83, 5.04, 5.1, 5.06, 5.23, 5.3, 5.01, 5.36, 5.48, 5.31, 5.18, 4.78, 5.53, 4.96, 5.13, 5.17, 5.14, 5.14, 6.06, 5.88, 6.08, 5.84, 5.8, 6.45, 5.75, 5.88, 6.19, 6.27, 6.18, 6.11, 6.5, 6.32, 6.05, 5.88, 6.45, 5.97, 6.27, 6.1, 6.01, 6.23, 6.15, 5.88, 5.93, 5.89, 5.79, 5.96, 5.92, 6.05, 6.32, 6.16, 5.75, 6.18, 6.19, 5.39, 5.31, 5.22, 5.18, 5.28, 5.13, 5, 5.19, 5.31, 5.27, 5.36, 5, 5.22, 5, 5.31, 5.31, 5.49, 5.31, 5.05, 5.18, 5.05, 5.06, 5.34, 5.06, 5.09, 4.83, 5.15, 5.16, 5.13, 5.18, 5.35, 4.96, 4.96, 5, 5.13, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
seeds_train_data <- as.data.frame(matrix(vecSeed, nrow=105, byrow=FALSE))
names(seeds_train_data) <- c('area', 'perimeter', 'compactness', 'kernel_length', 'kernel_width', 'asymmetry', 'kernel_groove', 'seed_type')
seeds_train_data$seed_type <- as.factor(seeds_train_data$seed_type)
glimpse(seeds_train_data)


# Initialise h2o cluster
h2o::h2o.init()

# Convert data to h2o frame
seeds_train_data_hf <- h2o::as.h2o(seeds_train_data)

# Identify target and features
y <- "seed_type"
x <- setdiff(colnames(seeds_train_data_hf), y)

# Split data into train & validation sets
sframe <- h2o::h2o.splitFrame(seeds_train_data_hf, seed = 42)
train <- sframe[[1]]
valid <- sframe[[2]]

# Calculate ratio of the target variable in the training set
summary(seeds_train_data_hf$seed_type, exact_quantiles = TRUE)


# Train random forest model
rf_model <- h2o::h2o.randomForest(x = x, y = y, training_frame = train, validation_frame = valid)

# Calculate model performance
perf <- h2o::h2o.performance(rf_model, valid = TRUE)

# Extract confusion matrix
h2o::h2o.confusionMatrix(perf)

# Extract logloss
h2o::h2o.logloss(perf)


# Define hyperparameters
dl_params <- list(hidden = list(c(50, 50), c(100, 100)), epochs = c(5, 10, 15), 
                  rate = c(0.001, 0.005, 0.01)
                  )


# Define search criteria
search_criteria <- list(strategy = "RandomDiscrete",
                        max_runtime_secs = 10, # this is way too short & only used to keep runtime short!
                        seed = 42
                        )

# Train with random search
dl_grid <- h2o::h2o.grid("deeplearning", grid_id = "dl_grid", x = x, y = y,
                         training_frame = train, validation_frame = valid, seed = 42,
                         hyper_params = dl_params, search_criteria = search_criteria
                         )


# Define early stopping
stopping_params <- list(strategy = "RandomDiscrete", stopping_metric = "misclassification",
                        stopping_rounds = 2, stopping_tolerance = 0.1, seed = 42
                        )


# Run automatic machine learning
automl_model <- h2o::h2o.automl(x = x, y = y, training_frame = train, max_runtime_secs = 10,
                                sort_metric = "mean_per_class_error", leaderboard_frame = valid, seed = 42
                                )


# Extract the leaderboard
lb <- automl_model@leaderboard
head(lb)

# Assign best model new object name
aml_leader <- automl_model@leader

# Look at best model
summary(aml_leader)

Intermediate Functional Programming with purrr

Chapter 1 - Programming with purrr

Refresher of purrr Basics:

  • The map() function is one of the most basic purrr calls
    • map(.x, .f, .) # for each element of .x do .f
  • OpenData files available from French city St Malo
    • JSON format; nested list
  • The map() function will always return a list by default
    • res <- map(visit_2015, sum) # returns a list
  • Can override to other preferred outputs, such as map_dbl()
    • res <- map_dbl(visit_2015, sum) # returns a numeric
  • Can also extend to map2(.x, .y, .f, .) which resolves to do .f(.x, .y, .)
    • res <- map2(visit_2015, visit_2016, sum)
    • res <- map2_dbl(visit_2015, visit_2016, sum)
  • Can use pmap() to run operations on 3+ items, though these need to be passed in as a list
    • l <- list(visit_2014, visit_2015, visit_2016)
    • res <- pmap(l, sum)
    • res <- pmap_dbl(l, sum)

Introduction to mappers:

  • The .f is the action element - function applied to every element, number n to extract the nth element, character vector of named elements to extract
  • The functions can either be regular functions or lambda (anonymous) functions
    • map_dbl(visit_2014, function(x) { round(mean(x)) })
  • The anonymous function with a one-sided formula can be written in any of several ways
    • map_dbl(visits2017, ~ round(mean(.x))) # typically the default
    • map_dbl(visits2017, ~ round(mean(.)))
    • map_dbl(visits2017, ~ round(mean(..1)))
    • map2(visits2016, visits2017, ~ .x + .y)
    • map2(visits2016, visits2017, ~ ..1 + ..2)
  • Can extend to data with more than 2 parameters
    • pmap(list, ~ ..1 + ..2 + ..3)
  • Can use as_mapper to create mapper objects from lambda functions
    • round_mean <- function(x){ round(mean(x)) }
    • round_mean <- as_mapper(~ round(mean(.x))))
  • Mappers have several benefits
    • More concise
    • Easier to read than functions
    • Reusable

Using Mappers to Clean Data:

  • Can use set_names from purrr to set the names of a list
    • visits2016 <- set_names(visits2016, month.abb)
    • all_visits <- list(visits2015, visits2016, visits2017)
    • named_all_visits <- map(all_visits, ~ set_names(.x, month.abb))
  • The keep() function extracts elements that satisfy a condition
    • over_30000 <- keep(visits2016, ~ sum(.x) > 30000)
    • limit <- as_mapper(~ sum(.x) > 30000)
    • over_mapper <- keep(visits2016, limit)
  • The discard() function removes elements that satisfy a condition
    • under_30000 <- discard(visits2016, ~ sum(.x) > 30000)
    • limit <- as_mapper(~ sum(.x) > 30000)
    • under_mapper <- discard(visits2016, limit)
    • names(under_mapper)
  • Can use keep() and discard() with map() to clean up lists
    • df_list <- list(iris, airquality) %>% map(head)
    • map(df_list, ~ keep(.x, is.factor))

Predicates:

  • Predicates return either TRUE or FALSE - example of is.numeric()
  • Predicate functionals take an element and a predicate, and then use the predicate on the element
    • keep(airquality, is.numeric) # keep all elements that return TRUE when run against the predicate
  • There are also extensions of every() and some()
    • every(visits2016, is.numeric)
    • every(visits2016, ~ mean(.x) > 1000)
    • some(visits2016, ~ mean(.x) > 1000)
  • The detect_index() returns the first and last element that satisfies a condition
    • detect_index(visits2016, ~ mean(.x) > 1000) # index of first element that satisfies
    • detect_index(visits2016, ~ mean(.x) > 1000, .right = TRUE) # index of last element that satisfies
  • The detect() returns the value rather than the index
    • detect(visits2016, ~ mean(.x) > 1000, .right = TRUE)
  • The has_element() tests whether an object contains an item
    • visits2016_mean <- map(visits2016, mean)
    • has_element(visits2016_mean,981)

Example code includes:

# Create the to_day function
to_day <- function(x) {
 x*24
}

visit_a <- c(117, 147, 131, 73, 81, 134, 121)
visit_b <- c(180, 193, 116, 166, 131, 153, 146)
visit_c <- c(57, 110, 68, 72, 87, 141, 67)

# Create a list containing both vectors: all_visits
all_visits <- list(visit_a, visit_b)

# Convert to daily number of visits: all_visits_day
all_visits_day <- map(all_visits, to_day)

# Map the mean() function and output a numeric vector 
map_dbl(all_visits_day, mean)
## [1] 2756.571 3720.000
# You'll test out both map() and walk() for plotting
# Both return the "side effects," that is to say, the changes in the environment (drawing plots, downloading a file, changing the working directory...), but walk() won't print anything to the console.

# Create all_tests list  and modify with to_day() function
all_tests <- list(visit_a, visit_b, visit_c)
all_tests_day <- map(all_tests, to_day)

# Plot all_tests_day with map
map(all_tests_day, barplot)

## [[1]]
##      [,1]
## [1,]  0.7
## [2,]  1.9
## [3,]  3.1
## [4,]  4.3
## [5,]  5.5
## [6,]  6.7
## [7,]  7.9
## 
## [[2]]
##      [,1]
## [1,]  0.7
## [2,]  1.9
## [3,]  3.1
## [4,]  4.3
## [5,]  5.5
## [6,]  6.7
## [7,]  7.9
## 
## [[3]]
##      [,1]
## [1,]  0.7
## [2,]  1.9
## [3,]  3.1
## [4,]  4.3
## [5,]  5.5
## [6,]  6.7
## [7,]  7.9
# Plot all_tests_day
walk(all_tests_day, barplot)

# Get sum of all visits and class of sum_all
sum_all <- pmap(all_tests_day, sum)
class(sum_all)
## [1] "list"
# Turn visit_a into daily number using an anonymous function
map(visit_a, function(x) {
  x*24
})
## [[1]]
## [1] 2808
## 
## [[2]]
## [1] 3528
## 
## [[3]]
## [1] 3144
## 
## [[4]]
## [1] 1752
## 
## [[5]]
## [1] 1944
## 
## [[6]]
## [1] 3216
## 
## [[7]]
## [1] 2904
# Turn visit_a into daily number of visits by using a mapper
map(visit_a, ~.x*24)
## [[1]]
## [1] 2808
## 
## [[2]]
## [1] 3528
## 
## [[3]]
## [1] 3144
## 
## [[4]]
## [1] 1752
## 
## [[5]]
## [1] 1944
## 
## [[6]]
## [1] 3216
## 
## [[7]]
## [1] 2904
# Create a mapper object called to_day
to_day <- as_mapper(~.x*24)

# Use it on the three vectors
map(visit_a, to_day)
## [[1]]
## [1] 2808
## 
## [[2]]
## [1] 3528
## 
## [[3]]
## [1] 3144
## 
## [[4]]
## [1] 1752
## 
## [[5]]
## [1] 1944
## 
## [[6]]
## [1] 3216
## 
## [[7]]
## [1] 2904
map(visit_b, to_day)
## [[1]]
## [1] 4320
## 
## [[2]]
## [1] 4632
## 
## [[3]]
## [1] 2784
## 
## [[4]]
## [1] 3984
## 
## [[5]]
## [1] 3144
## 
## [[6]]
## [1] 3672
## 
## [[7]]
## [1] 3504
map(visit_c, to_day)
## [[1]]
## [1] 1368
## 
## [[2]]
## [1] 2640
## 
## [[3]]
## [1] 1632
## 
## [[4]]
## [1] 1728
## 
## [[5]]
## [1] 2088
## 
## [[6]]
## [1] 3384
## 
## [[7]]
## [1] 1608
# Round visit_a to the nearest tenth with a mapper
map_dbl(visit_a, ~ round(.x, -1))
## [1] 120 150 130  70  80 130 120
# Create to_ten, a mapper that rounds to the nearest tenth
to_ten <- as_mapper(~ round(.x, -1))

# Map to_ten on visit_b
map_dbl(visit_b, to_ten)
## [1] 180 190 120 170 130 150 150
# Map to_ten on visit_c
map_dbl(visit_c, to_ten)
## [1]  60 110  70  70  90 140  70
# Create a mapper that test if .x is more than 100 
is_more_than_hundred <- as_mapper(~ .x > 100)

# Run this mapper on the all_visits object
all_visits <- list(visit_a, visit_b, visit_c)
map(all_visits, ~ keep(.x, is_more_than_hundred) )
## [[1]]
## [1] 117 147 131 134 121
## 
## [[2]]
## [1] 180 193 116 166 131 153 146
## 
## [[3]]
## [1] 110 141
# Use the  day vector to set names to all_list
day <- c("mon", "tue", "wed", "thu", "fri", "sat", "sun")
full_visits_named <- map(all_visits, ~ set_names(.x, day))

# Use this mapper with keep() 
map(full_visits_named, ~ keep(.x, is_more_than_hundred))
## [[1]]
## mon tue wed sat sun 
## 117 147 131 134 121 
## 
## [[2]]
## mon tue wed thu fri sat sun 
## 180 193 116 166 131 153 146 
## 
## [[3]]
## tue sat 
## 110 141
# Set the name of each subvector
day <- c("mon", "tue", "wed", "thu", "fri", "sat", "sun")
all_visits_named <- map(all_visits, ~ set_names(.x, day))

# Create a mapper that will test if .x is over 100 
threshold <- as_mapper(~.x > 100)

# Run this mapper on the all_visits object: group_over
group_over <- map(all_visits, ~ keep(.x, threshold) )

# Run this mapper on the all_visits object: group_under
group_under <-  map(all_visits, ~ discard(.x, threshold) )


# Create a threshold variable, set it to 160
threshold <- 160

# Create a mapper that tests if .x is over the defined threshold
over_threshold <- as_mapper(~ .x > threshold)

# Are all elements in every all_visits vectors over the defined threshold? 
map(all_visits, ~ every(.x, over_threshold))
## [[1]]
## [1] FALSE
## 
## [[2]]
## [1] FALSE
## 
## [[3]]
## [1] FALSE
# Are some elements in every all_visits vectors over the defined threshold? 
map(all_visits, ~ some(.x, over_threshold))
## [[1]]
## [1] FALSE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] FALSE

Chapter 2 - Functional Programming from Theory to Practice

Functional Programming in R:

  • Everything that exists is an object and everything that happens is a function call
    • This means that a function is an object and can be treated as such
    • Every action in R is performed by a function
    • Functions are first-class citizens, and behave like any other object
    • Functions can be manipulated, stored as variables, lambda (anonymous), stored in a list, arguments of a function, returned by a function
    • R is a functional programming language
  • In a “pure function”, output depends only on input, and there are no side-effects (no changes to the environment)
    • Sys.Date() depends on the enviornment and is thus not pure
    • write.csv() is called solely for the side effect (writing a file) and is thus not pure

Tools for Functional Programming in purrr:

  • A high order function can take functions as input and return functions as output
    • nop_na <- function(fun) {
    • function(…){fun(…, na.rm = TRUE)}
    • }
    • sd_no_na <- nop_na(sd)
    • sd_no_na( c(NA, 1, 2, NA) )
  • There are three types of high-order functions
    • Functionals take another function and return a vector - like map()
    • Function fatories take a vector and create a function
    • Function operators take functions and return functions - considered to be “adverbs”
  • Two of the most common adverbs in purrr are safely() and possibly()
    • The safely() call returns a function that will return $result and $error when run; helpful for diagnosing issues with code rather than losing the information
    • safe_log <- safely(log)
    • safe_log(“a”) # there will be $result of NULL and $error with the error code
    • map( list(2, “a”), safely(log) )

Using possibly():

  • The possibly() function is an adverb that returns either the value of the function OR the value specified in the otherwise element
    • possible_sum <- possibly(sum, otherwise = “nop”)
    • possible_sum(“a”) # result will be “nop”
  • Note that possibly() cannot be made to run a function; it will just return a pre-specified value

Handling adverb results:

  • Can use transpose() to change the output (converts the list to inside out)
    • Transpose turn a list of n elements a and b to a list of a and b, with each n elements
  • The compact() function will remove the NULL elements
    • l <- list(1,2,3,“a”)
    • possible_log <- possibly(log, otherwise = NULL)
    • map(l, possible_log) %>% compact()
  • Can use the httr package specifically for http requests
    • httr::GET(url) will return the value from attempting to connect to url - 200 is good, 404 is unavailable, etc.

Example code includes:

# `$` is a function call, of a special type called 'infix operator', as they are put between two elements, and can be used without parenthesis.

# Launch Sys.time(), Sys.sleep(1), & Sys.time()
Sys.time()
## [1] "2019-06-10 08:40:53 CDT"
Sys.sleep(1)
Sys.time()
## [1] "2019-06-10 08:40:54 CDT"
# Launch nrow(iris), Sys.sleep(1), & nrow(iris)
data(iris)
nrow(iris)
## [1] 150
Sys.sleep(1)
nrow(iris)
## [1] 150
# Launch ls(), create an object, then rerun the ls() function
# ls()
# this <- 12
# ls()

# Create a plot of the iris dataset
plot(iris)

urls <- c('https://thinkr.fr', 'https://colinfay.me', 'http://not_working.org', 'https://datacamp.com', 'http://cran.r-project.org/', 'https://not_working_either.org')
# Create a safe version of read_lines()
safe_read <- safely(read_lines)

# Map it on the urls vector
res <- map(urls, safe_read)

# Set the name of the results to `urls`
named_res <-  set_names(res, urls)

# Extract only the "error" part of each sublist
map(named_res, "error")
## $`https://thinkr.fr`
## NULL
## 
## $`https://colinfay.me`
## NULL
## 
## $`http://not_working.org`
## <simpleError in open.connection(con, "rb"): Could not resolve host: not_working.org>
## 
## $`https://datacamp.com`
## NULL
## 
## $`http://cran.r-project.org/`
## NULL
## 
## $`https://not_working_either.org`
## <simpleError in open.connection(con, "rb"): Could not resolve host: not_working_either.org>
# Code a function that discard() the NULL from safe_read()
safe_read_discard <- function(url){
    safe_read(url) %>%
        discard(is.null)
}

# Map this function on the url list
res <- map(urls, safe_read_discard)


# Create a possibly() version of read_lines()
possible_read <- possibly(read_lines, otherwise = 404)

# Map this function on urls, pipe it into set_names()
res <- map(urls, possible_read) %>% set_names(urls)

# Paste each element of the list 
res_pasted <- map(res, paste, collapse=" ")

# Keep only the elements which are equal to 404
keep(res_pasted, ~ .x == 404)
## $`http://not_working.org`
## [1] "404"
## 
## $`https://not_working_either.org`
## [1] "404"
url_tester <- function(url_list){
    url_list %>%
        # Map a version of read_lines() that otherwise returns 404
        map( possibly(read_lines, otherwise = 404) ) %>%
        # Set the names of the result
        set_names( urls ) %>% 
        # paste() and collapse each element
        map(paste, collapse =" ") %>%
        # Remove the 404 
        discard(~.x==404) %>%
        names() # Will return the names of the good ones
}

# Try this function on the urls object
url_tester(urls)
## [1] "https://thinkr.fr"          "https://colinfay.me"       
## [3] "https://datacamp.com"       "http://cran.r-project.org/"
url_tester <- function(url_list, type = c("result", "error")){
    res <- url_list %>%
        # Create a safely() version of read_lines() 
        map( safely(read_lines) ) %>%
        set_names( url_list ) %>%
        # Transpose into a list of $result and $error
        purrr::transpose() 
    # Complete this if statement
    if (type == "result") return( res$result ) 
    if (type == "error") return( res$error ) 
}

# Try this function on the urls object
url_tester(urls, type = "error") 
## $`https://thinkr.fr`
## NULL
## 
## $`https://colinfay.me`
## NULL
## 
## $`http://not_working.org`
## <simpleError in open.connection(con, "rb"): Could not resolve host: not_working.org>
## 
## $`https://datacamp.com`
## NULL
## 
## $`http://cran.r-project.org/`
## NULL
## 
## $`https://not_working_either.org`
## <simpleError in open.connection(con, "rb"): Could not resolve host: not_working_either.org>
url_tester <- function(url_list){
    url_list %>%
        # Map a version of GET() that would otherwise return NULL 
        map( possibly(httr::GET, otherwise=NULL) ) %>%
        # Set the names of the result
        set_names( urls ) %>%
        # Remove the NULL
        compact() %>%
        # Extract all the "status_code" elements
        map("status_code")
}

# Try this function on the urls object
url_tester(urls)
## $`https://thinkr.fr`
## [1] 200
## 
## $`https://colinfay.me`
## [1] 200
## 
## $`https://datacamp.com`
## [1] 200
## 
## $`http://cran.r-project.org/`
## [1] 200

Chapter 3 - Better Code with purrr

Rationale for cleaner code:

  • Cleaner code is easier to debug (spot typos), easier to interpret, and easier to modify
    • tidy_iris_lm <- compose( as_mapper(~ filter(.x, p.value < 0.05)), tidy, partial(lm, data=iris, na.action = na.fail) )
    • list( Petal.Length ~ Petal.Width, Petal.Width ~ Sepal.Width, Sepal.Width ~ Sepal.Length ) %>% map(tidy_iris_lm)
  • Clean code characteristics
    • Light - no unnecessary code
    • Readable - less repition makes for easier reading (one piece of code for one task)
    • Interpretable
    • Maintainable
  • The compose() function is used to compose a function from two or more functions
    • rounded_mean <- compose(round, mean)

Building functions with compose() and negate():

  • There is a limitless amount of functions that can be included in compose()
    • clean_aov <- compose(tidy, anova, lm)
  • Can use negate() to flip the predicate - TRUE becomes FALSE and FALSE becomes TRUE
    • is_not_na <- negate(is.na)
    • under_hundred <- as_mapper(~ mean(.x) < 100)
    • not_under_hundred <- negate(under_hundred)
    • map_lgl(98:102, under_hundred)
    • map_lgl(98:102, not_under_hundred)
  • The “good” status return codes from GET() are in the low-200s
    • good_status <- c(200, 201, 202, 203)
    • status %in% good_status

Prefilling functions:

  • The partial() allows for pre-filling a function
    • mean_na_rm <- partial(mean, na.rm = TRUE)
    • lm_iris <- partial(lm, data = iris)
  • Can also combine partial() and compose()
    • rounded_mean <- compose( partial(round, digits = 2), partial(mean, na.rm = TRUE) )
  • Can use functions from rvest for web scraping
    • read_html()
    • html_nodes()
    • html_text()
    • html_attr()

List columns:

  • A list column is part of a nested data frame - one or more of the data frame columns is itself a list (requires use of tibble rather than data.frame)
    • df <- tibble( classic = c(“a”, “b”,“c”), list = list( c(“a”, “b”,“c”), c(“a”, “b”,“c”, “d”), c(“a”, “b”,“c”, “d”, “e”) ) )
    • a_node <- partial(html_nodes, css = “a”)
    • href <- partial(html_attr, name = “href”)
    • get_links <- compose( href, a_node, read_html )
    • urls_df <- tibble( urls = c(“https://thinkr.fr”, “https://colinfay.me”, “https://datacamp.com”, “http://cran.r-project.org/”) )
    • urls_df %>% mutate(links = map(urls, get_links))
  • Can also unnest the data from the list columns
    • urls_df %>% mutate(links = map(urls, get_links)) %>% unnest()
  • Can also nest() a standard data.frame
    • iris_n <- iris %>% group_by(Species) %>% tidyr::nest()
  • Since the list column is a list, the purrr functions can be run on them
    • iris_n %>% mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)))
    • summary_lm <- compose(summary, lm)
    • iris %>% group_by(Species) %>% nest() %>% mutate(data = map(data, ~ summary_lm(Sepal.Length ~ Sepal.Width, data = .x)), data = map(data, “r.squared”)) %>% unnest()

Example code includes:

urls <- c('https://thinkr.fr', 'https://colinfay.me', 'https://datacamp.com', 'http://cran.r-project.org/')

    
# Compose a status extractor 
status_extract <- purrr::compose(httr::status_code, httr::GET)

# Try with "https://thinkr.fr" & "http://datacamp.com"
status_extract("https://thinkr.fr")
## [1] 200
status_extract("http://datacamp.com")
## [1] 200
# Map it on the urls vector, return a vector of numbers
map_dbl(urls, status_extract)
## [1] 200 200 200 200
# Negate the %in% function 
`%not_in%` <- negate(`%in%`)

# Complete the function
strict_code <- function(url){
    code <- status_extract(url)
    if (code %not_in% c(200:203)){
        return(NA)
    } else {
        return(code)
    } 
}


# Map the strict_code function on the urls vector
res <- map(urls, strict_code)

# Set the names of the results using the urls vector
res_named <- set_names(res, urls)

# Negate the is.na function
is_not_na <- negate(is.na)

# Run is_not_na on the results
is_not_na(res_named)
##          https://thinkr.fr        https://colinfay.me 
##                       TRUE                       TRUE 
##       https://datacamp.com http://cran.r-project.org/ 
##                       TRUE                       TRUE
# Prefill html_nodes() with the css param set to h2
get_h2 <- partial(rvest::html_nodes, css="h2")

# Combine the html_text, get_h2 and read_html functions
get_content <- purrr::compose(rvest::html_text, get_h2, xml2::read_html)

# Map get_content to the urls list
res <- map(urls, get_content) %>%
    set_names(urls)

# Print the results to the console
res
## $`https://thinkr.fr`
##  [1] "Conseil, développement et formation au logiciel R"                                          
##  [2] "Formez-vous au logiciel R !"                                                                
##  [3] "\r\n\t\tRstudio & ThinkR roadshow – le 6 juin à Paris\r\n\t"                                   
##  [4] "\r\n\t\tBilan pédagogique et financier 2018\r\n\t"                                             
##  [5] "\r\n\t\tPédagogie de la formation au langage R\r\n\t"                                          
##  [6] "\r\n\t\tRetour sur les projets R des étudiants du MSc X-HEC Data Science for Business\r\n\t"   
##  [7] "\r\n\t\tConstruisons la certification R du RConsortium\r\n\t"                                  
##  [8] "\r\n\t\tLes tests statistiques\r\n\t"                                                          
##  [9] "\r\n\t\tÀ la découverte de RStudio Package Manager\r\n\t"                                      
## [10] "\r\n\t\tLes pièges de la représentation de données\r\n\t"                                      
## [11] "\r\n\t\tDBI : Distributeur des Brasseurs Indépendants ? Non DataBase Interface\r\n\t"          
## 
## $`https://colinfay.me`
## character(0)
## 
## $`https://datacamp.com`
## character(0)
## 
## $`http://cran.r-project.org/`
## character(0)
# Create a partial version of html_nodes(), with the css param set to "a"
a_node <- partial(rvest::html_nodes, css="a")

# Create href(), a partial version of html_attr()
href <- partial(rvest::html_attr, name = "href")

# Combine href(), a_node(), and read_html()
get_links <- purrr::compose(href, a_node, xml2::read_html)

# Map get_links() to the urls list
res <- map(urls, get_links) %>%
    set_names(urls)


# Create a "links" columns, by mapping get_links() on urls
df2 <- tibble::tibble(urls=urls) %>%
    mutate(links = map(urls, get_links)) 

# Print df2 to see what it looks like
df2
## # A tibble: 4 x 2
##   urls                       links      
##   <chr>                      <list>     
## 1 https://thinkr.fr          <chr [145]>
## 2 https://colinfay.me        <chr [47]> 
## 3 https://datacamp.com       <chr [50]> 
## 4 http://cran.r-project.org/ <chr [1]>
# unnest() df2 to have a tidy dataframe
df2 %>%
    unnest()
## # A tibble: 243 x 2
##    urls            links                                                   
##    <chr>           <chr>                                                   
##  1 https://thinkr~ https://thinkr.fr/                                      
##  2 https://thinkr~ https://thinkr.fr/                                      
##  3 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/              
##  4 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/introduction-~
##  5 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/statistique-a~
##  6 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/programmation~
##  7 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/r-et-le-big-d~
##  8 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/r-pour-la-fin~
##  9 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/integration-d~
## 10 https://thinkr~ https://thinkr.fr/formation-au-logiciel-r/formation-sig~
## # ... with 233 more rows

Chapter 4 - Case Study

Discovering the Dataset:

  • The dataset is available from https://github.com/ThinkR-open/datasets
    • rstudioconf: a list of 5055 tweets
    • length(rstudioconf)
    • length(rstudioconf[[1]])
    • purrr::vec_depth(rstudioconf)
  • JSON is a standard data format for the web, and typically consists of key-value pairs which are read as nested lists by R
  • Refresher of keep() and discard() usage
    • keep(1:10, ~ .x < 5)
    • discard(1:10, ~ .x < 5)

Extracting Information from the Dataset:

  • Can manipulate functions for list cleaning using high-order functions - includes partial() and compose()
    • sum_no_na <- partial(sum, na.rm = TRUE)
    • map_dbl(airquality, sum_no_na)
    • rounded_sum <- compose(round, sum_no_na)
    • map_dbl(airquality, rounded_sum)
  • Can also clean lists using compact() to remove NULL and flatten() to remove one level from a nested list
    • l <- list(NULL, 1, 2, 3, NULL)
    • compact(l)
    • my_list <- list( list(a = 1), list(b = 2) )
    • flatten(my_list)

Manipulating URL:

  • Can use the mapper functions to create a re-usable function
    • mult <- as_mapper(~ .x * 2)
  • Can use str_detect inside the mapper function
    • lyrics <- c(“Is this the real life?”, “Is this just fantasy?”, “Caught in a landslide”, “No escape from reality”)
    • stringr::str_detect(a, “life”)

Identifying Influencers:

  • Can use the map_at() function to run a function at a specific portion of the list
    • my_list <- list( a = 1:10, b = 1:100, c = 12 )
    • map_at(.x = my_list, .at = “b”, .f = sum)
  • Can also use negate() to reverse the actio of a predicate
    • not_character <- negate(is.character)
    • my_list <- list( a = 1:10, b = “a”, c = iris )
    • map(my_list, not_character)

Wrap up:

  • Lambda functions and reusable mappers
    • map(1:5, ~ .x*10)
    • ten_times <- as_mapper(~ .x * 10)
    • map(1:5, ten_times)
  • Function manipulation using functionals (functions that take functions as inputs and return vectors)
    • map() & friends
    • keep() & discard()
    • some() & every()
  • Function operators take functions and return (modified) functions
    • safely() & possibly()
    • partial()
    • compose()
    • negate()
  • Cleaner code is easier to read, understand, and maintain
    • rounded_mean <- compose( partial(round, digits = 1), partial(mean, trim = 2, na.rm = TRUE) )
    • map( list(airquality, mtcars), ~ map_dbl(.x, rounded_mean) )

Example code includes:

rstudioconfDF <- readRDS("./RInputFiles/#RStudioConf.RDS")
rstudioconfListA <- split(rstudioconfDF, seq(nrow(rstudioconfDF)))
rstudioconf <- lapply(rstudioconfListA, FUN=as.list)


# Print the first element of the list to the console 
rstudioconf[[1]]
## $status_id
## [1] "960732355773239296"
## 
## $created_at
## [1] "2018-02-06 04:30:17 UTC"
## 
## $user_id
## [1] "626266741"
## 
## $screen_name
## [1] "grod_rf"
## 
## $text
## [1] "RT @dataandme: <U+0001F41C> Check it, @ajmcoqui's \"Debugging in RStudio\" \n<U+0001F4FD> Slides *and* cheat sheet!\nhttps://t.co/rAvKP9iXLa #rstats #rstudioconf htt…"
## 
## $source
## [1] "Twitter for Android"
## 
## $reply_to_status_id
## [1] NA
## 
## $reply_to_user_id
## [1] NA
## 
## $reply_to_screen_name
## [1] NA
## 
## $is_quote
## [1] FALSE
## 
## $is_retweet
## [1] TRUE
## 
## $favorite_count
## [1] 0
## 
## $retweet_count
## [1] 7
## 
## $hashtags
## $hashtags[[1]]
## [1] "rstats"      "rstudioconf"
## 
## 
## $symbols
## $symbols[[1]]
## [1] NA
## 
## 
## $urls_url
## $urls_url[[1]]
## [1] "buff.ly/2s7W8ED"
## 
## 
## $urls_t.co
## $urls_t.co[[1]]
## [1] "https://t.co/rAvKP9iXLa"
## 
## 
## $urls_expanded_url
## $urls_expanded_url[[1]]
## [1] "https://buff.ly/2s7W8ED"
## 
## 
## $media_url
## $media_url[[1]]
## [1] NA
## 
## 
## $media_t.co
## $media_t.co[[1]]
## [1] NA
## 
## 
## $media_expanded_url
## $media_expanded_url[[1]]
## [1] NA
## 
## 
## $media_type
## $media_type[[1]]
## [1] NA
## 
## 
## $ext_media_url
## $ext_media_url[[1]]
## [1] NA
## 
## 
## $ext_media_t.co
## $ext_media_t.co[[1]]
## [1] NA
## 
## 
## $ext_media_expanded_url
## $ext_media_expanded_url[[1]]
## [1] NA
## 
## 
## $ext_media_type
## [1] NA
## 
## $mentions_user_id
## $mentions_user_id[[1]]
## [1] "3230388598"         "732925397814247426"
## 
## 
## $mentions_screen_name
## $mentions_screen_name[[1]]
## [1] "dataandme" "ajmcoqui" 
## 
## 
## $lang
## [1] "en"
## 
## $quoted_status_id
## [1] NA
## 
## $quoted_text
## [1] NA
## 
## $retweet_status_id
## [1] "960600422556880896"
## 
## $retweet_text
## [1] "<U+0001F41C> Check it, @ajmcoqui's \"Debugging in RStudio\" \n<U+0001F4FD> Slides *and* cheat sheet!\nhttps://t.co/rAvKP9iXLa #rstats #rstudioconf https://t.co/T4627GcuXK"
## 
## $place_url
## [1] NA
## 
## $place_name
## [1] NA
## 
## $place_full_name
## [1] NA
## 
## $place_type
## [1] NA
## 
## $country
## [1] NA
## 
## $country_code
## [1] NA
## 
## $geo_coords
## $geo_coords[[1]]
## [1] NA NA
## 
## 
## $coords_coords
## $coords_coords[[1]]
## [1] NA NA
## 
## 
## $bbox_coords
## $bbox_coords[[1]]
## [1] NA NA NA NA NA NA NA NA
# Create a sublist of non-retweets
non_rt <- discard(rstudioconf, "is_retweet")

# Extract the favorite count element of each non_rt sublist
fav_count <- map_dbl(non_rt, "favorite_count")

# Get the median of favorite_count for non_rt
median(fav_count)
## [1] 1
# Keep the RT, extract the user_id, remove the duplicate
rt <- keep(rstudioconf, "is_retweet") %>%
    map("user_id") %>%
    unique()

# Remove the RT, extract the user id, remove the duplicate
non_rt <- discard(rstudioconf, "is_retweet") %>%
    map("user_id") %>% 
    unique()

# Determine the total number of users
union(rt, non_rt) %>% 
    length()
## [1] 1742
# Determine the number of users who has just retweeted
setdiff(rt, non_rt) %>% 
    length()
## [1] 1302
# Prefill mean() with na.rm, and round() with digits = 1
mean_na_rm <- partial(mean, na.rm=TRUE)
round_one <- partial(round, digits=1)

# Compose a rounded_mean function
rounded_mean <- purrr::compose(round_one, mean_na_rm)

# Extract the non retweet  
non_rt <- discard(rstudioconf, "is_retweet")

# Extract "favorite_count", and pass it to rounded_mean()
map_dbl(non_rt, "favorite_count") %>%
    rounded_mean()
## [1] 3.3
# Combine as_vector(), compact(), and flatten()
flatten_to_vector <- purrr::compose(as_vector, compact, flatten)

# Complete the fonction
extractor <- function(list, what = "mentions_screen_name"){
    map(list, what) %>%
        flatten_to_vector()
}

# Create six_most, with tail(), sort(), and table()
six_most <- purrr::compose(tail, sort, table)

# Run extractor() on rstudioconf
extractor(rstudioconf) %>% 
    six_most()
## .
##    JennyBryan hadleywickham      AmeliaMN    juliasilge          drob 
##           278           308           362           376           418 
##       rstudio 
##           648
# Extract the "urls_url" elements, and flatten() the result
urls_clean <- map(rstudioconf, "urls_url") %>%
    lapply(FUN=function(x) { ifelse(is.na(x), list(NULL), x) }) %>%
    flatten()

# Remove the NULL
compact_urls <- compact(urls_clean)

# Create a mapper that detects the patten "github"
has_github <- as_mapper(~ str_detect(.x[1], "github"))

# Look for the "github" pattern, and sum the result
map_lgl( compact_urls, has_github ) %>%
    sum()
## [1] 346
# Complete the function
ratio_pattern <- function(vec, pattern){
    n_pattern <- str_detect(vec, pattern) %>%
        sum()
    n_pattern / length(vec)
}

# Create flatten_and_compact()
flatten_and_compact <- purrr::compose(compact, flatten)

# Complete the pipe to get the ratio of URLs with "github"
map(rstudioconf, "urls_url") %>%
    lapply(FUN=function(x) { ifelse(is.na(x), list(NULL), x) }) %>%
    flatten_and_compact() %>% 
    ratio_pattern("github")
## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing
## [1] 0.3033217
# Create mean_above, a mapper that tests if .x is over 3.3
mean_above <- as_mapper(~ . > 3.3)

# Prefil map_at() with "retweet_count", mean_above for above, 
# and mean_above negation for below
above <- partial(map_at, .at = "retweet_count", .f = mean_above )
below <- partial(map_at, .at = "retweet_count", .f = negate(mean_above) )

# Map above() and below() on non_rt, keep the "retweet_count"
# ab <- map(non_rt, above) %>% keep("retweet_count")
# bl <- map(non_rt, below) %>% keep("retweet_count")

rtCounts <- sapply(map(non_rt, "retweet_count"), FUN=function(x) { x })
ab <- rtCounts[rtCounts > 3.3]
bl <- rtCounts[rtCounts <= 3.3]

# Compare the size of both elements
length(ab)
## [1] 83
length(bl)
## [1] 1741
# Get the max() of "retweet_count" 
max_rt <- map_dbl(non_rt, "retweet_count") %>% 
    max()

# Prefill map_at() with a mapper testing if .x equal max_rt
# max_rt_calc <- partial(map_at, .at = "retweet_count", .f = ~.x==max_rt )

idxMatch <- which(map(non_rt, "retweet_count") == max_rt)

# Map max_rt_calc on non_rt, keep the retweet_count & flatten
# res <- map(non_rt, max_rt_calc) %>% 
#     keep("retweet_count") %>% 
#     flatten()

# Print the "screen_name" and "text" of the result
res <- non_rt[[idxMatch]]
res$screen_name
## [1] "kearneymw"
res$text
## [1] "The week of #rstudioconf is a good time to remind everyone that some important books are [intentionally] available online for free:\nhttps://t.co/ePMiKs3MAr\nhttps://t.co/NHR7wmLGgd\nhttps://t.co/wbymwjG0CD\nhttps://t.co/uwqG0q967M\nhttps://t.co/AjXTfZgyAg\nhttps://t.co/zgoHq51PGV"

Longitudinal Analysis in R

Chapter 1 - Introduction to Longitudinal Data

Introduction to Longitudinal Data:

  • Longitudinal data are data with 3+ measurements on the same unit (individual)
    • Blood pressure every week for 6 weeks
    • Match scores for a student from grade 3-8
    • Extracurricular yes/no for each semester of high school
  • Dichotomous values are TRUE/FALSE while continuous have multiple values
    • Multiple measurements allow for time-series modeling
    • Two measurements allow for ANCOVA or t-tests
  • Example longitudinal data for rat weights from nlme
    • library(nlme)
    • head(BodyWeight, n = 10)
    • count(BodyWeight, Rat)
    • count(BodyWeight, Time)
    • count(BodyWeight, Diet)

Data Restructuring and Correlations:

  • Data are often stored in wide format, with each individual being a row
  • Analysis in R is generally best in long format, with time as one of the columns
    • tidyr::gather() function for wide to long
    • tidyr::spread() function for long to wide
  • Long to wide transforms can be helpful for calculating correlations
    • BodyWeight %>% mutate(Time = paste0(‘Time_’, Time)) %>% spread(Time, weight) %>% select(Rat, Diet, Time_1, Time_8, everything())
  • Wide to long transforms are more common
    • gather(BodyWeight_wide, key = Time, value = weight, Time_1:Time_64) # the colon operator : can be very helpful for these transforms
  • Can explore correlations over time and how the dependencies change over time
  • The corrr package can help with exploring correlations
    • correlate(): to compute correlation matrix
    • shave(): to remove extra information from matrix
    • fashion(): to format correlation matrix
    • BodyWeight %>% mutate(Time = paste0(‘T_’, Time)) %>% spread(Time, weight) %>% select(Time_1, Time_8, Time_15:Time_64) %>% correlate() %>% shave(upper = FALSE) %>% fashion(decimals = 3)

Descriptive Statistics:

  • Numeric summaries are often the most useful when broken down by time or other factors of interest
  • The group_by() and summarize() functions can be very helpful
    • BodyWeight %>% group_by(Time) %>% summarize(mean_wgt = mean(weight, na.rm = TRUE), med_wgt = median(weight, na.rm = TRUE), min_wgt = min(weight, na.rm = TRUE), max_wgt = max(weight, na.rm = TRUE), sd_wgt = sd(weight, na.rm = TRUE), num_miss = sum(is.na(weight)), n = n())
  • Violin plots can be useful for assessing distributions at each point in time
    • ggplot(BodyWeight, aes(x = factor(Time), y = weight)) + geom_violin(aes(fill = Diet)) + xlab(“Time (in days)”) + ylab(“Weight”) + theme_bw(base_size = 16)

Example code includes:

data(calcium, package="lava")
str(calcium)
## 'data.frame':    501 obs. of  6 variables:
##  $ bmd   : num  0.815 0.875 0.911 0.952 0.97 0.813 0.833 0.855 0.881 0.901 ...
##  $ group : Factor w/ 2 levels "C","P": 1 1 1 1 1 2 2 2 2 2 ...
##  $ person: int  101 101 101 101 101 102 102 102 102 102 ...
##  $ visit : int  1 2 3 4 5 1 2 3 4 5 ...
##  $ age   : num  10.9 11.4 11.9 12.4 12.9 ...
##  $ ctime : int  11078 11266 11436 11625 11807 11078 11266 11427 11616 11791 ...
# Individuals with data at each visit number
count(calcium, visit)
## # A tibble: 5 x 2
##   visit     n
##   <int> <int>
## 1     1   112
## 2     2   105
## 3     3    99
## 4     4    94
## 5     5    91
# Individuals in each group
count(calcium, person)
## # A tibble: 112 x 2
##    person     n
##     <int> <int>
##  1    101     5
##  2    102     5
##  3    103     5
##  4    104     5
##  5    105     5
##  6    106     5
##  7    107     5
##  8    108     2
##  9    109     5
## 10    110     5
## # ... with 102 more rows
# Individuals in each group
count(calcium, group)
## # A tibble: 2 x 2
##   group     n
##   <fct> <int>
## 1 C       245
## 2 P       256
# Individuals with each visit number in each group
count(calcium, visit, group)
## # A tibble: 10 x 3
##    visit group     n
##    <int> <fct> <int>
##  1     1 C        55
##  2     1 P        57
##  3     2 C        52
##  4     2 P        53
##  5     3 C        48
##  6     3 P        51
##  7     4 C        46
##  8     4 P        48
##  9     5 C        44
## 10     5 P        47
# Restructure data into wide format for correlations
calcium_wide <- calcium %>%
    mutate(visit_char = paste0('visit_', visit)) %>%
    select(bmd, person, visit_char) %>%
    spread(visit_char, bmd)

# Calculate correlations across time
calcium_corr <- calcium_wide %>%
    select(-person) %>%
    corrr::correlate(method="pearson") %>%
    corrr::shave(upper=FALSE) %>%
    corrr::fashion(decimals=3)
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
# Convert data from wide to long format
calcium_wide %>% 
    gather(key="visit", value="bmd", -person)
##     person   visit   bmd
## 1      101 visit_1 0.815
## 2      102 visit_1 0.813
## 3      103 visit_1 0.812
## 4      104 visit_1 0.804
## 5      105 visit_1 0.904
## 6      106 visit_1 0.831
## 7      107 visit_1 0.777
## 8      108 visit_1 0.792
## 9      109 visit_1 0.821
## 10     110 visit_1 0.823
## 11     111 visit_1 0.828
## 12     112 visit_1 0.797
## 13     113 visit_1 0.867
## 14     114 visit_1 0.795
## 15     115 visit_1 0.835
## 16     116 visit_1 0.870
## 17     117 visit_1 0.856
## 18     118 visit_1 0.762
## 19     119 visit_1 0.758
## 20     120 visit_1 0.800
## 21     121 visit_1 0.795
## 22     122 visit_1 0.874
## 23     123 visit_1 0.830
## 24     124 visit_1 0.815
## 25     125 visit_1 0.800
## 26     126 visit_1 0.787
## 27     127 visit_1 0.795
## 28     128 visit_1 0.746
## 29     129 visit_1 0.837
## 30     130 visit_1 0.847
## 31     131 visit_1 0.832
## 32     132 visit_1 0.784
## 33     133 visit_1 0.883
## 34     134 visit_1 0.785
## 35     135 visit_1 0.822
## 36     136 visit_1 0.811
## 37     137 visit_1 0.815
## 38     201 visit_1 0.840
## 39     202 visit_1 0.866
## 40     203 visit_1 0.905
## 41     204 visit_1 0.883
## 42     205 visit_1 0.881
## 43     206 visit_1 0.915
## 44     207 visit_1 0.913
## 45     208 visit_1 0.868
## 46     209 visit_1 0.901
## 47     210 visit_1 0.879
## 48     211 visit_1 0.876
## 49     212 visit_1 0.989
## 50     213 visit_1 0.930
## 51     214 visit_1 0.896
## 52     215 visit_1 0.871
## 53     301 visit_1 0.902
## 54     302 visit_1 0.865
## 55     303 visit_1 0.910
## 56     305 visit_1 0.894
## 57     306 visit_1 0.897
## 58     307 visit_1 0.921
## 59     308 visit_1 0.840
## 60     309 visit_1 0.889
## 61     310 visit_1 0.819
## 62     311 visit_1 0.840
## 63     312 visit_1 0.835
## 64     313 visit_1 0.933
## 65     314 visit_1 0.894
## 66     315 visit_1 0.825
## 67     316 visit_1 0.837
## 68     317 visit_1 0.871
## 69     318 visit_1 0.840
## 70     319 visit_1 0.909
## 71     320 visit_1 0.923
## 72     321 visit_1 0.874
## 73     322 visit_1 0.841
## 74     323 visit_1 0.871
## 75     324 visit_1 0.827
## 76     325 visit_1 0.811
## 77     326 visit_1 0.856
## 78     327 visit_1 0.842
## 79     328 visit_1 0.860
## 80     329 visit_1 0.998
## 81     330 visit_1 0.876
## 82     331 visit_1 0.971
## 83     401 visit_1 1.028
## 84     402 visit_1 0.871
## 85     403 visit_1 0.981
## 86     404 visit_1 1.005
## 87     405 visit_1 1.012
## 88     406 visit_1 0.961
## 89     407 visit_1 0.948
## 90     408 visit_1 0.907
## 91     409 visit_1 0.936
## 92     410 visit_1 0.856
## 93     411 visit_1 0.970
## 94     412 visit_1 0.927
## 95     413 visit_1 0.921
## 96     414 visit_1 0.883
## 97     415 visit_1 0.955
## 98     416 visit_1 1.014
## 99     417 visit_1 0.938
## 100    418 visit_1 0.961
## 101    419 visit_1 0.879
## 102    420 visit_1 0.941
## 103    421 visit_1 0.945
## 104    422 visit_1 0.875
## 105    423 visit_1 0.861
## 106    424 visit_1 0.888
## 107    425 visit_1 0.928
## 108    426 visit_1 0.936
## 109    427 visit_1 0.859
## 110    428 visit_1 0.991
## 111    429 visit_1 0.971
## 112    430 visit_1 0.969
## 113    101 visit_2 0.875
## 114    102 visit_2 0.833
## 115    103 visit_2 0.812
## 116    104 visit_2 0.847
## 117    105 visit_2 0.927
## 118    106 visit_2 0.855
## 119    107 visit_2 0.803
## 120    108 visit_2 0.814
## 121    109 visit_2 0.850
## 122    110 visit_2 0.827
## 123    111 visit_2 0.873
## 124    112 visit_2 0.818
## 125    113 visit_2 0.873
## 126    114 visit_2 0.812
## 127    115 visit_2 0.849
## 128    116 visit_2 0.872
## 129    117 visit_2    NA
## 130    118 visit_2 0.769
## 131    119 visit_2 0.759
## 132    120 visit_2 0.824
## 133    121 visit_2 0.835
## 134    122 visit_2 0.902
## 135    123 visit_2 0.857
## 136    124 visit_2 0.829
## 137    125 visit_2 0.833
## 138    126 visit_2 0.792
## 139    127 visit_2 0.828
## 140    128 visit_2 0.748
## 141    129 visit_2 0.849
## 142    130 visit_2 0.829
## 143    131 visit_2 0.862
## 144    132 visit_2 0.785
## 145    133 visit_2 0.892
## 146    134 visit_2 0.778
## 147    135 visit_2    NA
## 148    136 visit_2 0.839
## 149    137 visit_2 0.799
## 150    201 visit_2 0.867
## 151    202 visit_2 0.924
## 152    203 visit_2 0.955
## 153    204 visit_2 0.916
## 154    205 visit_2 0.904
## 155    206 visit_2 0.940
## 156    207 visit_2 0.949
## 157    208 visit_2 0.868
## 158    209 visit_2 0.926
## 159    210 visit_2 0.873
## 160    211 visit_2 0.916
## 161    212 visit_2 1.011
## 162    213 visit_2 0.968
## 163    214 visit_2 0.907
## 164    215 visit_2 0.896
## 165    301 visit_2 0.941
## 166    302 visit_2 0.910
## 167    303 visit_2 0.937
## 168    305 visit_2 0.894
## 169    306 visit_2    NA
## 170    307 visit_2 0.953
## 171    308 visit_2 0.868
## 172    309 visit_2 0.920
## 173    310 visit_2 0.853
## 174    311 visit_2 0.874
## 175    312 visit_2 0.866
## 176    313 visit_2 0.923
## 177    314 visit_2 0.922
## 178    315 visit_2 0.867
## 179    316 visit_2 0.869
## 180    317 visit_2 0.875
## 181    318 visit_2 0.861
## 182    319 visit_2 0.929
## 183    320 visit_2 0.908
## 184    321 visit_2    NA
## 185    322 visit_2 0.853
## 186    323 visit_2 0.885
## 187    324 visit_2 0.823
## 188    325 visit_2 0.839
## 189    326 visit_2 0.876
## 190    327 visit_2 0.851
## 191    328 visit_2 0.870
## 192    329 visit_2    NA
## 193    330 visit_2    NA
## 194    331 visit_2 0.978
## 195    401 visit_2    NA
## 196    402 visit_2 0.904
## 197    403 visit_2 1.010
## 198    404 visit_2 1.049
## 199    405 visit_2 1.051
## 200    406 visit_2 0.981
## 201    407 visit_2 0.987
## 202    408 visit_2 0.930
## 203    409 visit_2 0.968
## 204    410 visit_2 0.902
## 205    411 visit_2 1.004
## 206    412 visit_2 0.944
## 207    413 visit_2 0.952
## 208    414 visit_2 0.934
## 209    415 visit_2 0.979
## 210    416 visit_2 1.055
## 211    417 visit_2 0.980
## 212    418 visit_2 0.977
## 213    419 visit_2 0.914
## 214    420 visit_2 0.967
## 215    421 visit_2 1.024
## 216    422 visit_2 0.892
## 217    423 visit_2 0.870
## 218    424 visit_2 0.903
## 219    425 visit_2 0.959
## 220    426 visit_2 0.942
## 221    427 visit_2 0.910
## 222    428 visit_2 1.037
## 223    429 visit_2 0.973
## 224    430 visit_2 1.011
## 225    101 visit_3 0.911
## 226    102 visit_3 0.855
## 227    103 visit_3 0.843
## 228    104 visit_3 0.885
## 229    105 visit_3 0.952
## 230    106 visit_3 0.890
## 231    107 visit_3 0.817
## 232    108 visit_3    NA
## 233    109 visit_3 0.865
## 234    110 visit_3 0.839
## 235    111 visit_3 0.935
## 236    112 visit_3 0.817
## 237    113 visit_3 0.893
## 238    114 visit_3 0.827
## 239    115 visit_3 0.860
## 240    116 visit_3    NA
## 241    117 visit_3    NA
## 242    118 visit_3    NA
## 243    119 visit_3 0.805
## 244    120 visit_3 0.859
## 245    121 visit_3 0.856
## 246    122 visit_3 0.922
## 247    123 visit_3 0.891
## 248    124 visit_3 0.852
## 249    125 visit_3 0.866
## 250    126 visit_3 0.830
## 251    127 visit_3 0.838
## 252    128 visit_3 0.756
## 253    129 visit_3 0.891
## 254    130 visit_3 0.862
## 255    131 visit_3 0.904
## 256    132 visit_3 0.816
## 257    133 visit_3 0.950
## 258    134 visit_3 0.792
## 259    135 visit_3    NA
## 260    136 visit_3 0.869
## 261    137 visit_3 0.810
## 262    201 visit_3 0.934
## 263    202 visit_3 0.954
## 264    203 visit_3 0.963
## 265    204 visit_3 0.924
## 266    205 visit_3 0.921
## 267    206 visit_3 0.945
## 268    207 visit_3 1.010
## 269    208 visit_3 0.923
## 270    209 visit_3 0.952
## 271    210 visit_3 0.892
## 272    211 visit_3 0.942
## 273    212 visit_3 1.053
## 274    213 visit_3 0.987
## 275    214 visit_3 0.942
## 276    215 visit_3 0.932
## 277    301 visit_3 0.977
## 278    302 visit_3 0.918
## 279    303 visit_3 0.962
## 280    305 visit_3    NA
## 281    306 visit_3    NA
## 282    307 visit_3 0.951
## 283    308 visit_3    NA
## 284    309 visit_3 0.960
## 285    310 visit_3 0.889
## 286    311 visit_3 0.889
## 287    312 visit_3 0.900
## 288    313 visit_3 0.955
## 289    314 visit_3 0.909
## 290    315 visit_3 0.875
## 291    316 visit_3 0.860
## 292    317 visit_3 0.913
## 293    318 visit_3 0.904
## 294    319 visit_3 0.968
## 295    320 visit_3 0.936
## 296    321 visit_3    NA
## 297    322 visit_3 0.882
## 298    323 visit_3 0.922
## 299    324 visit_3 0.829
## 300    325 visit_3 0.859
## 301    326 visit_3 0.908
## 302    327 visit_3 0.873
## 303    328 visit_3 0.884
## 304    329 visit_3    NA
## 305    330 visit_3    NA
## 306    331 visit_3 0.985
## 307    401 visit_3    NA
## 308    402 visit_3 0.963
## 309    403 visit_3 1.041
## 310    404 visit_3 1.038
## 311    405 visit_3 1.080
## 312    406 visit_3 0.991
## 313    407 visit_3 1.023
## 314    408 visit_3 0.955
## 315    409 visit_3 0.973
## 316    410 visit_3 0.915
## 317    411 visit_3 1.052
## 318    412 visit_3 0.981
## 319    413 visit_3 0.981
## 320    414 visit_3 0.965
## 321    415 visit_3 1.028
## 322    416 visit_3 1.067
## 323    417 visit_3 1.036
## 324    418 visit_3 0.996
## 325    419 visit_3 0.933
## 326    420 visit_3 0.994
## 327    421 visit_3 1.065
## 328    422 visit_3    NA
## 329    423 visit_3 0.894
## 330    424 visit_3 0.922
## 331    425 visit_3 0.999
## 332    426 visit_3 0.975
## 333    427 visit_3 0.975
## 334    428 visit_3 1.062
## 335    429 visit_3 0.990
## 336    430 visit_3 1.024
## 337    101 visit_4 0.952
## 338    102 visit_4 0.881
## 339    103 visit_4 0.855
## 340    104 visit_4 0.920
## 341    105 visit_4 0.955
## 342    106 visit_4 0.908
## 343    107 visit_4 0.809
## 344    108 visit_4    NA
## 345    109 visit_4 0.879
## 346    110 visit_4 0.885
## 347    111 visit_4 0.952
## 348    112 visit_4 0.847
## 349    113 visit_4 0.907
## 350    114 visit_4 0.861
## 351    115 visit_4 0.898
## 352    116 visit_4    NA
## 353    117 visit_4    NA
## 354    118 visit_4    NA
## 355    119 visit_4 0.839
## 356    120 visit_4 0.893
## 357    121 visit_4 0.893
## 358    122 visit_4 0.955
## 359    123 visit_4 0.933
## 360    124 visit_4 0.898
## 361    125 visit_4 0.888
## 362    126 visit_4 0.840
## 363    127 visit_4 0.860
## 364    128 visit_4    NA
## 365    129 visit_4 0.924
## 366    130 visit_4 0.896
## 367    131 visit_4 0.914
## 368    132 visit_4 0.830
## 369    133 visit_4 0.982
## 370    134 visit_4 0.822
## 371    135 visit_4    NA
## 372    136 visit_4 0.909
## 373    137 visit_4 0.822
## 374    201 visit_4 0.947
## 375    202 visit_4 0.991
## 376    203 visit_4 0.986
## 377    204 visit_4 0.944
## 378    205 visit_4 0.938
## 379    206 visit_4 0.999
## 380    207 visit_4 1.058
## 381    208 visit_4 0.959
## 382    209 visit_4    NA
## 383    210 visit_4    NA
## 384    211 visit_4    NA
## 385    212 visit_4 1.063
## 386    213 visit_4 1.026
## 387    214 visit_4 0.974
## 388    215 visit_4 0.951
## 389    301 visit_4 0.995
## 390    302 visit_4 0.942
## 391    303 visit_4 0.997
## 392    305 visit_4    NA
## 393    306 visit_4    NA
## 394    307 visit_4 0.992
## 395    308 visit_4    NA
## 396    309 visit_4 0.986
## 397    310 visit_4 0.912
## 398    311 visit_4 0.903
## 399    312 visit_4 0.938
## 400    313 visit_4 1.014
## 401    314 visit_4 0.966
## 402    315 visit_4 0.934
## 403    316 visit_4 0.883
## 404    317 visit_4 0.919
## 405    318 visit_4 0.935
## 406    319 visit_4 0.999
## 407    320 visit_4 0.946
## 408    321 visit_4    NA
## 409    322 visit_4 0.907
## 410    323 visit_4 0.932
## 411    324 visit_4 0.855
## 412    325 visit_4 0.905
## 413    326 visit_4 0.907
## 414    327 visit_4 0.905
## 415    328 visit_4 0.887
## 416    329 visit_4    NA
## 417    330 visit_4    NA
## 418    331 visit_4 1.026
## 419    401 visit_4    NA
## 420    402 visit_4 0.975
## 421    403 visit_4 1.087
## 422    404 visit_4    NA
## 423    405 visit_4 1.114
## 424    406 visit_4 1.002
## 425    407 visit_4 1.050
## 426    408 visit_4 0.972
## 427    409 visit_4 0.987
## 428    410 visit_4 0.923
## 429    411 visit_4 1.092
## 430    412 visit_4 1.005
## 431    413 visit_4 1.009
## 432    414 visit_4 0.971
## 433    415 visit_4 1.046
## 434    416 visit_4 1.096
## 435    417 visit_4 1.044
## 436    418 visit_4 1.016
## 437    419 visit_4 0.945
## 438    420 visit_4 1.038
## 439    421 visit_4 1.113
## 440    422 visit_4    NA
## 441    423 visit_4 0.914
## 442    424 visit_4 0.935
## 443    425 visit_4 1.035
## 444    426 visit_4 1.010
## 445    427 visit_4 1.012
## 446    428 visit_4 1.073
## 447    429 visit_4 1.020
## 448    430 visit_4 1.054
## 449    101 visit_5 0.970
## 450    102 visit_5 0.901
## 451    103 visit_5 0.895
## 452    104 visit_5 0.948
## 453    105 visit_5 1.002
## 454    106 visit_5 0.933
## 455    107 visit_5 0.823
## 456    108 visit_5    NA
## 457    109 visit_5 0.908
## 458    110 visit_5 0.923
## 459    111 visit_5    NA
## 460    112 visit_5 0.862
## 461    113 visit_5 0.934
## 462    114 visit_5 0.889
## 463    115 visit_5 0.913
## 464    116 visit_5    NA
## 465    117 visit_5    NA
## 466    118 visit_5    NA
## 467    119 visit_5 0.852
## 468    120 visit_5 0.921
## 469    121 visit_5 0.929
## 470    122 visit_5 0.972
## 471    123 visit_5 0.970
## 472    124 visit_5 0.924
## 473    125 visit_5 0.920
## 474    126 visit_5 0.863
## 475    127 visit_5 0.932
## 476    128 visit_5    NA
## 477    129 visit_5 0.961
## 478    130 visit_5 0.904
## 479    131 visit_5 0.952
## 480    132 visit_5 0.849
## 481    133 visit_5 0.993
## 482    134 visit_5 0.816
## 483    135 visit_5    NA
## 484    136 visit_5 0.930
## 485    137 visit_5 0.833
## 486    201 visit_5 0.953
## 487    202 visit_5 1.020
## 488    203 visit_5 0.987
## 489    204 visit_5 0.994
## 490    205 visit_5 0.972
## 491    206 visit_5 1.023
## 492    207 visit_5 1.063
## 493    208 visit_5 0.992
## 494    209 visit_5    NA
## 495    210 visit_5    NA
## 496    211 visit_5    NA
## 497    212 visit_5 1.076
## 498    213 visit_5 1.047
## 499    214 visit_5 0.983
## 500    215 visit_5 0.973
## 501    301 visit_5 0.988
## 502    302 visit_5 0.982
## 503    303 visit_5 0.999
## 504    305 visit_5    NA
## 505    306 visit_5    NA
## 506    307 visit_5 0.992
## 507    308 visit_5    NA
## 508    309 visit_5 1.017
## 509    310 visit_5 0.913
## 510    311 visit_5 0.924
## 511    312 visit_5 0.965
## 512    313 visit_5 1.022
## 513    314 visit_5 0.981
## 514    315 visit_5 0.961
## 515    316 visit_5 0.894
## 516    317 visit_5 0.926
## 517    318 visit_5    NA
## 518    319 visit_5 0.999
## 519    320 visit_5 0.950
## 520    321 visit_5    NA
## 521    322 visit_5 0.912
## 522    323 visit_5 0.971
## 523    324 visit_5 0.868
## 524    325 visit_5 0.946
## 525    326 visit_5 0.922
## 526    327 visit_5 0.912
## 527    328 visit_5 0.931
## 528    329 visit_5    NA
## 529    330 visit_5    NA
## 530    331 visit_5 1.057
## 531    401 visit_5    NA
## 532    402 visit_5 0.984
## 533    403 visit_5 1.120
## 534    404 visit_5    NA
## 535    405 visit_5 1.104
## 536    406 visit_5 1.011
## 537    407 visit_5 1.053
## 538    408 visit_5 0.988
## 539    409 visit_5 0.994
## 540    410 visit_5 0.952
## 541    411 visit_5 1.084
## 542    412 visit_5 1.005
## 543    413 visit_5 1.022
## 544    414 visit_5 0.980
## 545    415 visit_5 1.068
## 546    416 visit_5 1.119
## 547    417 visit_5 1.112
## 548    418 visit_5 1.012
## 549    419 visit_5    NA
## 550    420 visit_5 1.022
## 551    421 visit_5 1.126
## 552    422 visit_5    NA
## 553    423 visit_5 0.927
## 554    424 visit_5 0.988
## 555    425 visit_5 1.066
## 556    426 visit_5 1.014
## 557    427 visit_5 1.023
## 558    428 visit_5 1.083
## 559    429 visit_5 1.051
## 560    430 visit_5 1.071
# Calculate descriptive statistics
calcium %>%
    group_by(visit, group) %>%
    summarize(avg_bmd = mean(bmd, na.rm = TRUE), median_bmd = median(bmd, na.rm = TRUE),
              minimum_bmd = min(bmd, na.rm = TRUE), maximum_bmd = max(bmd, na.rm = TRUE),
              standev_bmd = sd(bmd, na.rm = TRUE), num_miss = sum(is.na(bmd)), n = n()
              )
## # A tibble: 10 x 9
## # Groups:   visit [5]
##    visit group avg_bmd median_bmd minimum_bmd maximum_bmd standev_bmd
##    <int> <fct>   <dbl>      <dbl>       <dbl>       <dbl>       <dbl>
##  1     1 C       0.880      0.879       0.762        1.03      0.0597
##  2     1 P       0.870      0.861       0.746        1.01      0.0658
##  3     2 C       0.903      0.905       0.769        1.05      0.0593
##  4     2 P       0.890      0.869       0.748        1.06      0.0751
##  5     3 C       0.938      0.938       0.816        1.08      0.0588
##  6     3 P       0.914      0.904       0.756        1.07      0.0780
##  7     4 C       0.964      0.955       0.83         1.11      0.0651
##  8     4 P       0.942      0.925       0.809        1.10      0.0753
##  9     5 C       0.988      0.986       0.849        1.13      0.0629
## 10     5 P       0.958      0.95        0.816        1.12      0.0736
## # ... with 2 more variables: num_miss <int>, n <int>
# Visualize distributions of outcome over time
ggplot(calcium, aes(x = factor(visit), y = bmd)) + 
    geom_violin(aes(fill=group)) + 
    xlab("Visit Number") + 
    ylab("Bone Mineral Density") + 
    theme_bw(base_size = 16)


Chapter 2 - Modeling Continuous Longitudinal Outcomes

Longitudinal Analysis for Continuous Outcomes:

  • Dependencies are introduced based on repeated observations of the same individual
    • library(nlme)
    • ggplot(BodyWeight, aes(x = Time, y = weight)) + geom_line(aes(group = Rat), alpha = 0.6) + geom_smooth(se = FALSE, size = 2) + theme_bw(base_size = 16) + xlab(“Number of Days”) + ylab(“Weight (grams)”) # se is invalid due to the dependencies
  • The lmer stands for “Linear Mixed Effects Regression” - aka Hierarchical Linear Model, Linear Mixed Models, Multi-level Models, Growth Models
    • lmer(outcome ~ fixed_effects + (random_effects | individual), data = data)
    • outcome ~ fixed_effects + (random_effects | individual) # outcome is the variable to be explained, fixed_effects are like a simple regression, random_effects represents deviation per individual, individual are the individual IDs
  • Example of model fitting with the rat data
    • BodyWeight <- mutate(BodyWeight, Time = Time - 1)
    • body_ri <- lmer(weight ~ 1 + Time + (1 | Rat), data = BodyWeight)
    • summary(body_ri)
  • The fixed effects represent the average effects across all the rats

Addition of Random Slope Terms:

  • Random slopes are commonly used along with random intercepts - each individual can have their own random slopes
    • weight ~ 1 + Time + (1 + Time | Rat) # fixed stays the same (though terms may be added)
    • BodyWeight <- mutate(BodyWeight, Time = Time - 1)
    • body_rs <- lmer(weight ~ 1 + Time + (1 + Time | Rat), data = BodyWeight)
  • Sometimes a question of whether random intercepts or random slopes fit the data better - can use anova()
    • Akaike Information Criterion (AIC): smaller is better (recommended when the true model is unknown)
    • Bayesian Information Criterion (BIC): smaller is better (imposes larger penalties for adding predictors)
    • Log Likelihood: value minimized during estimation: smaller is better
  • Nested models can be compared using anova() - subset or simplification of another model
    • body_ri <- lmer(weight ~ 1 + Time + (1 | Rat), data = BodyWeight)
    • body_rs <- lmer(weight ~ 1 + Time + (1 + Time | Rat), data = BodyWeight)
    • anova(body_rs, body_ri)

Visualize and Interpret Output:

  • Visualizing model results is an important component of presenting and interpreting the data
    • body_agg <- BodyWeight %>% mutate(pred_values = predict(body_ri))
    • ggplot(body_agg, aes(x = Time, y = pred_values)) + geom_line(aes(group = Rats), alpha = 0.6) + theme_bw(base_size = 16) + xlab(“Number of Days”) + ylab(“Model Implied Values”)
  • Random effects help control for dependencies due to repeated measurements
    • Custom function (next slide) will help explore model implied correlations
  • Example of creating and applying a custom function for correlations (“compound symmetry”)
    • corr_structure <- function(object, num_timepoints, intercept_only = TRUE) {
    • variance <- VarCorr(object)
    • if(intercept_only) {
    •   random_matrix <- as.matrix(object@pp$X[1:num_timepoints, 1])  
    •   var_cor <- random_matrix %*% variance[[1]][1] %*% t(random_matrix) + diag(attr(variance, "sc")^2, nrow = num_timepoints, ncol = num_timepoints)  
    • } else {
    • random_matrix <- as.matrix(object@pp$X[1:num_timepoints, ])
    • var_cor <- random_matrix %% variance[[1]][1:2, 1:2] %% t(random_matrix) + diag(attr(variance, “sc”)^2, nrow = num_timepoints, ncol = num_timepoints)
    • }
    • Matrix::cov2cor(var_cor)
    • }
    • body_ri <- lmer(weight ~ 1 + Time + (1 | Rat), data = BodyWeight)
    • corr_structure(body_ri, 11) %>% round(3)
  • Can visually show correlation dependencies
    • Ggally::ggcorr(data = NULL, cor_matrix = corr_structure(body_ri, 11), label = TRUE, label_round = 3, label_size = 3.5, palette = ‘Set2’, nbreaks = 5)

Example code includes:

# Visualize trajectories
ggplot(calcium, aes(x = visit, y = bmd)) +
    geom_line(aes(group = person), alpha = .4) +
    geom_smooth(se = FALSE, size = 2) +
    theme_bw(base_size = 14) +
    xlab('Visit Number') +
    ylab('Bone Mineral Density (g/cm^2)')
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Unconditional model
uncond_model <- lme4::lmer(bmd ~ 1 + visit + (1 | person), data = calcium)

# Show model output
summary(uncond_model)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bmd ~ 1 + visit + (1 | person)
##    Data: calcium
## 
## REML criterion at convergence: -2232.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.7257 -0.5815  0.0097  0.5980  3.0246 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  person   (Intercept) 0.0044907 0.06701 
##  Residual             0.0002475 0.01573 
## Number of obs: 501, groups:  person, 112
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 0.8503372  0.0065326   130.2
## visit       0.0247851  0.0005131    48.3
## 
## Correlation of Fixed Effects:
##       (Intr)
## visit -0.215
# Alter the visit variable to start at 0
calcium <- calcium %>%
    mutate(visit_0 = visit - 1)

# Fit random intercept model with new time variable
uncond_model_0 <- lme4::lmer(bmd ~ 1 + visit_0 + (1 | person), data = calcium)
summary(uncond_model_0)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bmd ~ 1 + visit_0 + (1 | person)
##    Data: calcium
## 
## REML criterion at convergence: -2232.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.7257 -0.5815  0.0097  0.5980  3.0246 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  person   (Intercept) 0.0044907 0.06701 
##  Residual             0.0002475 0.01573 
## Number of obs: 501, groups:  person, 112
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 0.8751223  0.0064416   135.9
## visit_0     0.0247851  0.0005131    48.3
## 
## Correlation of Fixed Effects:
##         (Intr)
## visit_0 -0.139
# Random slope
uncond_model_rs <- lme4::lmer(bmd ~ 1 + visit_0 + (1 + visit_0 | person), data = calcium)
summary(uncond_model_rs)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bmd ~ 1 + visit_0 + (1 + visit_0 | person)
##    Data: calcium
## 
## REML criterion at convergence: -2351.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2961 -0.5334 -0.0050  0.5035  2.2933 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr
##  person   (Intercept) 4.152e-03 0.064438     
##           visit_0     5.239e-05 0.007238 0.14
##  Residual             1.247e-04 0.011167     
## Number of obs: 501, groups:  person, 112
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 0.8752292  0.0061471   142.4
## visit_0     0.0245982  0.0008146    30.2
## 
## Correlation of Fixed Effects:
##         (Intr)
## visit_0 0.066
# Compare random slopes and random intercept only models
anova(uncond_model_rs, uncond_model_0)
## refitting model(s) with ML (instead of REML)
## Data: calcium
## Models:
## uncond_model_0: bmd ~ 1 + visit_0 + (1 | person)
## uncond_model_rs: bmd ~ 1 + visit_0 + (1 + visit_0 | person)
##                 Df     AIC     BIC logLik deviance  Chisq Chi Df
## uncond_model_0   4 -2246.0 -2229.2 1127.0  -2254.0              
## uncond_model_rs  6 -2359.8 -2334.5 1185.9  -2371.8 117.79      2
##                 Pr(>Chisq)    
## uncond_model_0                
## uncond_model_rs  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Create predicted values for random intercept only model
calcium_vis <- calcium %>%
    mutate(pred_values_ri = predict(uncond_model_0))
  
# Visualize random intercepts
ggplot(calcium_vis, aes(x = visit_0, y = pred_values_ri)) + 
    geom_line(size = 1, color = 'gray70', aes(group = person)) + 
    theme_bw() + 
    xlab("Visit Number") +
    ylab("Model Predicted Bone Mineral Density (g/cm^2)")

# Create predicted values for random intercept and slope model
calcium_vis <- calcium %>%
    mutate(pred_values_rs = predict(uncond_model_rs))

# Visualize random intercepts and slopes
ggplot(calcium_vis, aes(x = visit_0, y = pred_values_rs)) + 
    geom_line(size = 1, color = 'gray70', aes(group = person)) + 
    theme_bw() + 
    xlab("Visit Number") +
    ylab("Model Predicted Bone Mineral Density (g/cm^2)")

corr_structure <- function(object, num_timepoints, intercept_only = TRUE) {
    variance <- lme4::VarCorr(object)
    if(intercept_only) {
        random_matrix <- as.matrix(object@pp$X[1:num_timepoints, 1])
        var_cor <- random_matrix %*% variance[[1]][1] %*% t(random_matrix) + diag(attr(variance, "sc")^2, nrow = num_timepoints, ncol = num_timepoints)
    } else {
        random_matrix <- as.matrix(object@pp$X[1:num_timepoints, ])
        var_cor <- random_matrix %*% variance[[1]][1:2, 1:2] %*% t(random_matrix) + diag(attr(variance, "sc")^2, nrow = num_timepoints, ncol = num_timepoints)
    }
    Matrix::cov2cor(var_cor)
}

# Random intercept and slope model
random_slope <- lme4::lmer(bmd ~ 1 + visit_0 + (1 + visit_0 | person), data = calcium)

# Generate model implied correlation matrix
mod_corr <- corr_structure(random_slope, num_timepoints = 5, intercept_only = FALSE)
round(mod_corr, 3)
##       1     2     3     4     5
## 1 1.000 0.966 0.950 0.927 0.899
## 2 0.966 1.000 0.968 0.955 0.935
## 3 0.950 0.968 1.000 0.970 0.959
## 4 0.927 0.955 0.970 1.000 0.973
## 5 0.899 0.935 0.959 0.973 1.000
# Create visualization for correlation structure
GGally::ggcorr(data = NULL, cor_matrix = mod_corr, midpoint = NULL, 
               limits = NULL, label = TRUE, label_round = 3, label_size = 5, 
               nbreaks = 100, palette = 'PuBuGn'
               )
## Color gradient midpoint set at median correlation to 0.96


Chapter 3 - Add Fixed Predictor Variables

Adding Predictors:

  • Predictors will build up the fixed effects to help explain the outcomes - reduces the variance of the intercepts/slopes
    • BodyWeight <- BodyWeight %>% mutate(Time = Time - 1, diet_f = paste(“Diet”, Diet, sep = " “))
    • body_weight <- lmer(weight ~ 1 + Time + diet_f + (1 + Time | Rat), data = BodyWeight)
  • Visualizing predictors can be especially helpful
    • BodyWeight <- BodyWeight %>% mutate(Time = Time - 1, diet_f = paste(“Diet”, Diet, sep = " “))
    • body_weight <- lmer(weight ~ 1 + Time + diet_f + (1 + Time | Rat), data = BodyWeight)
    • bodyweight_agg <- BodyWeight %>% mutate(pred_values = predict(body_weight, re.form = NA)) %>% group_by(Time, Diet) %>% summarize(mean_diet_pred = mean(pred_values))
    • ggplot(bodyweight_agg, aes(x = Time, y = mean_diet_pred, color = Diet)) + geom_point(data = BodyWeight, aes(x = Time, y = weight)) + geom_line(size = 2) + ylab(“Body Weight”) + xlab(“Time (in days)”) + theme_bw(base_size = 16)

Adding Predictors - Interactions:

  • Can add predictor variables to the model
    • BodyWeight <- BodyWeight %>% mutate(Time = Time - 1, diet_f = paste(“Diet”, Diet, sep = " “))
    • body_weight <- lmer(weight ~ 1 + Time + diet_f + diet_f:Time + (1 + Time | Rat), data = BodyWeight)
  • Need to check assumptions to ensure that model results are trustworthy
    • Ensure model results are trustworthy
    • Explore the distribution of residuals and random effects
    • body_weight <- lmer(weight ~ 1 + Time + diet_f + diet_f:Time + (1 + Time | Rat), data = BodyWeight)
    • BodyWeight <- BodyWeight %>% mutate(model_residuals = residuals(body_weight)) ggplot(BodyWeight, aes(x = model_residuals)) + geom_density(aes(color = diet_f), size = 1.25) + xlab(“Residuals”) + theme_bw(base_size = 14) + scale_color_brewer(palette = “Set2”)
  • Can use the ranef() function to extract random effects from the model
    • body_weight <- lmer(weight ~ 1 + Time + diet_f + diet_f:Time + (1 + Time | Rat), data = BodyWeight)
    • random_effects <- ranef(body_weight)$Rat %>% mutate(id = 1:n()) %>% gather(“variable”, “value”, -id)
  • Can also run Q-Q plots to asses normality
    • ggplot(random_effects, aes(sample = value)) + geom_qq() + geom_qq_line() + facet_wrap(~variable, scales = ‘free_y’) + theme_bw(base_size = 14)

Model Comparisons and Eplained Variance:

  • Model comparisons may be made using AIC, AICc (corrected AIC), or BIC
    • AICc converges to AIC with large samples
  • Example of model comparisons
    • BodyWeight <- BodyWeight %>% mutate(Time = Time - 1, diet_f = paste(“Diet”, Diet, sep = " “))
    • body_weight_rs <- lmer(weight ~ 1 + Time + (1 + Time | Rat), data = BodyWeight, REML = FALSE) # REML=TRUE would be for only changes in random effects; REML=FALSE is needed when changing fixed effects (random effects should stay the same)
    • body_weight_diet <- lmer(weight ~ 1 + Time + diet_f + (1 + Time | Rat), data = BodyWeight, REML = FALSE)
    • body_weight_diet_int <- lmer(weight ~ 1 + Time + diet_f + diet_f:Time + (1 + Time | Rat), data = BodyWeight, REML = FALSE)
  • Can calculate AICc from AICcmodavg
    • AICcmodavg::aictab(list(body_weight_rs, body_weight_diet, body_weight_diet_int), modnames = c(‘random slope’, ‘diet intercept’, ‘diet interaction’))
  • Explained variance can also be helpful in model assessment - larger values are better
    • MuMIn::r.squaredGLMM(body_weight_rs)
    • MuMIn::r.squaredGLMM(body_weight_diet)
    • MuMIn::r.squaredGLMM(body_weight_diet_int)

Example code includes:

# Add a categorical predictor
bmd_group <- lme4::lmer(bmd ~ 1 + visit_0 + group + (1 + visit_0 | person), data = calcium)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.00541016
## (tol = 0.002, component 1)
summary(bmd_group)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bmd ~ 1 + visit_0 + group + (1 + visit_0 | person)
##    Data: calcium
## 
## REML criterion at convergence: -2344.7
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.30713 -0.53193 -0.00697  0.50756  2.29630 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr
##  person   (Intercept) 4.158e-03 0.064481     
##           visit_0     5.233e-05 0.007234 0.11
##  Residual             1.247e-04 0.011169     
## Number of obs: 501, groups:  person, 112
## 
## Fixed effects:
##               Estimate Std. Error t value
## (Intercept)  0.8803219  0.0087733 100.341
## visit_0      0.0245961  0.0008144  30.201
## groupP      -0.0100098  0.0122917  -0.814
## 
## Correlation of Fixed Effects:
##         (Intr) vist_0
## visit_0  0.031       
## groupP  -0.713  0.001
## convergence code: 0
## Model failed to converge with max|grad| = 0.00541016 (tol = 0.002, component 1)
# Add a continuous predictor
bmd_group_age <- lme4::lmer(bmd ~ 1 + visit_0 + group + age + (1 + visit_0 | person), data = calcium)
summary(bmd_group_age)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bmd ~ 1 + visit_0 + group + age + (1 + visit_0 | person)
##    Data: calcium
## 
## REML criterion at convergence: -2343.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.30240 -0.49803 -0.01591  0.51495  2.38624 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr
##  person   (Intercept) 0.0042073 0.06486      
##           visit_0     0.0000490 0.00700  0.12
##  Residual             0.0001244 0.01115      
## Number of obs: 501, groups:  person, 112
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  0.507361   0.153645   3.302
## visit_0      0.007942   0.006901   1.151
## groupP      -0.009896   0.012361  -0.801
## age          0.033701   0.013864   2.431
## 
## Correlation of Fixed Effects:
##         (Intr) vist_0 groupP
## visit_0  0.992              
## groupP  -0.039  0.002       
## age     -0.998 -0.993 -0.002
# Calculate aggregate trends
calcium_agg <- calcium %>%
    mutate(pred_values = predict(bmd_group_age, re.form = NA)) %>%
    group_by(visit_0, group) %>%
    summarize(pred_group = mean(pred_values))

# Visualize the model results
ggplot(calcium_agg, aes(x = visit_0, y = pred_group, color = group)) +
    geom_point(data = calcium, aes(x = visit_0, y = bmd, color = group)) +
    geom_line(size = 1.25) +
    xlab('Visit Number') +
    ylab('Model Predicted Bone Mineral Density (g/cm^2)')

# Add an interaction
bmd_group_age_int  <- lme4::lmer(bmd ~ 1 + visit_0 + age + group + visit_0:group + visit_0:age + (1 + visit_0 | person), data = calcium)
summary(bmd_group_age_int)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bmd ~ 1 + visit_0 + age + group + visit_0:group + visit_0:age +  
##     (1 + visit_0 | person)
##    Data: calcium
## 
## REML criterion at convergence: -2331.6
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.22416 -0.52727 -0.01329  0.52812  2.29995 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr
##  person   (Intercept) 4.219e-03 0.064953     
##           visit_0     4.465e-05 0.006682 0.12
##  Residual             1.231e-04 0.011096     
## Number of obs: 501, groups:  person, 112
## 
## Fixed effects:
##                  Estimate Std. Error t value
## (Intercept)     0.4582089  0.1540324   2.975
## visit_0         0.0235327  0.0097236   2.420
## age             0.0381262  0.0138911   2.745
## groupP         -0.0116086  0.0123912  -0.937
## visit_0:groupP -0.0044210  0.0015302  -2.889
## visit_0:age    -0.0011939  0.0006104  -1.956
## 
## Correlation of Fixed Effects:
##             (Intr) vist_0 age    groupP vs_0:P
## visit_0      0.579                            
## age         -0.998 -0.582                     
## groupP      -0.039 -0.003 -0.002              
## vist_0:grpP -0.018 -0.088  0.017  0.049       
## visit_0:age  0.157 -0.706 -0.154  0.000 -0.005
# Add residuals original data
calcium <- calcium %>%
    mutate(model_residuals = residuals(bmd_group_age_int))

# Visualize residuals
ggplot(calcium, aes(x = model_residuals)) + 
    geom_density(aes(color = group), size = 1.25) + 
    theme_bw(base_size = 14) + 
    xlab("Model Residuals")

# Extract random effects
random_effects <- lme4::ranef(bmd_group_age_int)$person %>%
    mutate(id = 1:n()) %>%
    gather("variable", "value", -id)

# Visualize random effects
ggplot(random_effects, aes(sample = value)) + 
    geom_qq() + 
    geom_qq_line() + 
    facet_wrap(~variable, scales = 'free_y') + 
    theme_bw(base_size = 14)

# Compare random slope, model with group variable, model with group and age, and model with interactions.
AICcmodavg::aictab(list(uncond_model_rs, bmd_group, bmd_group_age, bmd_group_age_int),
                   modnames = c('random slope', 'group intercept', 'group and age', 
                                'group and age interaction'
                                )
                   )
## Warning in aictab.AIClmerMod(list(uncond_model_rs, bmd_group, bmd_group_age, : 
## Model selection for fixed effects is only appropriate with ML estimation:
## REML (default) should only be used to select random effects for a constant set of fixed effects
## 
## Model selection based on AICc:
## 
##                            K     AICc Delta_AICc AICcWt Cum.Wt  Res.LL
## random slope               6 -2338.91       0.00   0.98   0.98 1175.54
## group intercept            7 -2330.51       8.40   0.01   1.00 1172.37
## group and age              8 -2327.48      11.42   0.00   1.00 1171.89
## group and age interaction 10 -2311.18      27.73   0.00   1.00 1165.81
# Compute explained variance for random slope only model
MuMIn::r.squaredGLMM(uncond_model_rs)
## Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help
## page.
##            R2m       R2c
## [1,] 0.2016978 0.9793256
# Compute explained variance for group and age predicting intercepts model
MuMIn::r.squaredGLMM(bmd_group_age)
##           R2m       R2c
## [1,] 0.206134 0.9794822
# Compute explained variance for interaction model
MuMIn::r.squaredGLMM(bmd_group_age_int)
##            R2m       R2c
## [1,] 0.2177605 0.9799048

Chapter 4 - Modeling Longitudinal Dichotomous Outcomes

Exploring and Modeling Dichotomous Outcomes:

  • Binary outcomes include yes/no, presence/absence, etc.
    • library(HSAUR2)
    • head(toenail, n = 10)
  • Generalized Linear Mixed Models (GLMM) explore the log-odds of success
    • toenail <- toenail %>% mutate(outcome_dich = ifelse(outcome == “none or mild”, 1, 0), visit_0 = visit - 1)
    • toenail %>% group_by(visit_0) %>% summarise(prop_outcome = mean(outcome_dich), num = n())
  • Two modificiations to lmer to run GLMM
    • use glmer instead of lmer
    • specify family = binomial argument
    • toe_output <- glmer(outcome_dich ~ 1 + visit_0 + treatment + ( 1 | patientID), data = toenail, family = binomial)
    • summary(toe_output)

Generalized Estimating Functions (GEE):

  • GEE is another way to estimate dichotomous data that are not continuous
  • The geepack package contains data for running GEE
    • toenail <- toenail %>% mutate(outcome_dich = ifelse(outcome == “none or mild”, 1, 0), visit_0 = visit - 1)
    • gee_toe <- geeglm(outcome_dich ~ 1 + visit_0, data = toenail, id = patientID, family = binomial, scale.fix = TRUE)
    • summary(gee_toe)
  • An optional argument, corstr is used to control the working correlation matrix
    • Accounts for the dependency due to repeated measures
    • The default is independence
    • gee_toe <- geeglm(outcome_dich ~ 1 + visit_0, data = toenail, id = patientID, family = binomial, corstr = ‘exchangeable’, scale.fix = TRUE)
    • summary(gee_toe)
  • Additional correlation matrices can be specified
    • corstr = “ar1” # specified correlation for each time lag
    • corstr = “unstructured” # unique correlation for each time lag

Model Selection:

  • The working correlation matrix can significantly impact study results
  • The QIC statistic is the quasi-likelihood under the independence model criterion - lower (smaller) is better
    • toenail <- toenail %>% mutate(outcome_dich = ifelse(outcome == “none or mild”, 1, 0), visit_0 = visit - 1)
    • gee_toe <- geeglm(outcome_dich ~ 1 + visit_0, data = toenail, id = patientID, family = binomial, scale.fix = TRUE)
    • MuMIn::QIC(gee_toe)
  • Can evaluate model criterion
    • gee_ind <- geeglm(outcome_dich ~ 1 + visit_0, data = toenail, id = patientID, family = binomial, scale.fix = TRUE)
    • gee_exch <- geeglm(outcome_dich ~ 1 + visit_0, data = toenail, id = patientID, family = binomial, scale.fix = TRUE, corstr = ‘exchangeable’)
    • gee_ar1 <- geeglm(outcome_dich ~ 1 + visit_0, data = toenail, id = patientID, family = binomial, scale.fix = TRUE, corstr = ‘ar1’)
    • MuMIn::QIC(gee_ind, gee_exch, gee_ar1)
  • Can also use the aictab() from AICcmodavg
    • toe_baseline <- glmer(outcome_dich ~ 1 + visit_0 + ( 1 | patientID), data = toenail, family = binomial)
    • toe_output <- glmer(outcome_dich ~ 1 + visit_0 + treatment + ( 1 | patientID), data = toenail, family = binomial)
    • AICcmodavg::aictab(list(toe_baseline, toe_output), c(“no treatment”, “treatement”))

Interpreting and Visualizing Model Results:

  • Can visualize the GLMM outputs
    • toe_output <- glmer(outcome_dich ~ 1 + visit_0 + treatment + ( 1 | patientID), data = toenail, family = binomial)
    • toenail <- toenail %>% mutate(pred_values = predict(toe_output))
    • ggplot(toenail, aes(x = visit_0, y = pred_values)) + geom_line(aes(group = patientID), linetype = 2) + theme_bw(base_size = 16) + xlab(“Visit Number”) + ylab(“Predicted Values”)
  • Probabilities are often more intuitive and can be found using type=“response”
    • toenail <- toenail %>% mutate(pred_values = predict(toe_output, type = “response”))
    • ggplot(toenail, aes(x = visit_0, y = pred_values)) + geom_line(aes(group = patientID), linetype = 2) + theme_bw(base_size = 16) + xlab(“Visit Number”) + ylab(“Prob of none or mild separation”)
  • Can use predict() using GEE similar to GLMM
    • gee_toe <- geeglm(outcome_dich ~ 1 + visit_0 + treatment, data = toenail, id = patientID, family = binomial, corstr = ‘exchangeable’, scale.fix = TRUE)
    • toenail_gee <- toenail %>% mutate(pred_gee = predict(gee_toe, type = “response”))
    • ggplot(toenail_gee, aes(x = visit_0, y = pred_gee)) + geom_line(aes(color = treatment)) + theme_bw(base_size = 16) + xlab(“Visit Number”) + ylab(“Probability of none or mild separation”)
  • Compare GLMM and GEE
    • toenail_glmm <- toenail %>% group_by(visit_0, treatment) %>% summarise(prob = mean(pred_values))
    • toenail_gee <- toenail_gee %>% group_by(visit_0, treatment) %>% summarise(prob = mean(pred_values))
    • toenail_agg = bind_rows( mutate(toenail_glmm, model = “GLMM”), mutate(toenail_gee, model = “GEE”) )
    • ggplot(toenail_agg, aes(x = visit_0, y = prob)) + geom_line(aes(color = treatment, linetype = model), size = 1) + theme_bw(base_size = 16) + xlab(“Visit Number”) + ylab(“Prob of non or mild separation”)

Example code includes:

ids <- rep(c(1:82, 85:87, 90), each=12)
months <- rep(0:11, times=length(unique(ids)))
symps <- c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, NA, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, NA, NA, NA, NA, NA, NA, NA, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0)
symps <- c(symps, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, NA, NA, NA, NA, NA, NA, NA, NA, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 0, 1, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, NA, NA, 1, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, NA, NA, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
age <- rep(c('Less than 20', 'Less than 20', 'Less than 20', '20 or Older', 'Less than 20', '20 or Older', 'Less than 20', 'Less than 20', '20 or Older', 'Less than 20', '20 or Older', '20 or Older', 'Less than 20', 'Less than 20', 'Less than 20', 'Less than 20', '20 or Older', '20 or Older', 'Less than 20', 'Less than 20', 'Less than 20', 'Less than 20', 'Less than 20', '20 or Older', '20 or Older', 'Less than 20', '20 or Older', 'Less than 20', '20 or Older', 'Less than 20', '20 or Older', 'Less than 20', 'Less than 20', 'Less than 20', '20 or Older', '20 or Older', 'Less than 20', 'Less than 20', 'Less than 20', '20 or Older', '20 or Older', '20 or Older', 'Less than 20', 'Less than 20', 'Less than 20', 'Less than 20', '20 or Older', 'Less than 20', 'Less than 20', 'Less than 20', '20 or Older', 'Less than 20', 'Less than 20', 'Less than 20', 'Less than 20', '20 or Older', 'Less than 20', 'Less than 20', 'Less than 20', '20 or Older', 'Less than 20', '20 or Older', 'Less than 20', '20 or Older', '20 or Older', 'Less than 20', 'Less than 20', 'Less than 20', 'Less than 20', 'Less than 20', 'Less than 20', 'Less than 20', '20 or Older', 'Less than 20', '20 or Older', 'Less than 20', '20 or Older', '20 or Older', 'Less than 20', 'Less than 20', '20 or Older', 'Less than 20', '20 or Older', 'Less than 20', 'Less than 20', 'Less than 20'), each=12)
sex <- rep(c('Male', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male'), each=12)
madras <- data.frame(id=ids, symptom=symps, month=months, age=age, sex=sex)


# Explore the first few rows of the madras data
str(madras)
## 'data.frame':    1032 obs. of  5 variables:
##  $ id     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ symptom: num  1 1 1 1 1 0 0 0 0 0 ...
##  $ month  : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ age    : Factor w/ 2 levels "20 or Older",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ sex    : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
head(madras)
##   id symptom month          age  sex
## 1  1       1     0 Less than 20 Male
## 2  1       1     1 Less than 20 Male
## 3  1       1     2 Less than 20 Male
## 4  1       1     3 Less than 20 Male
## 5  1       1     4 Less than 20 Male
## 6  1       0     5 Less than 20 Male
# Descriptives about symptom prevalence over time
summary_stats <- madras %>%
    group_by(month) %>%
    summarize(num_symptom = sum(symptom, na.rm = TRUE), 
              num = n(),
              prop_symptom = mean(symptom, na.rm = TRUE)
              )
  
# Print out summary statistics
summary_stats
## # A tibble: 12 x 4
##    month num_symptom   num prop_symptom
##    <int>       <dbl> <int>        <dbl>
##  1     0          56    86       0.651 
##  2     1          51    86       0.6   
##  3     2          43    86       0.518 
##  4     3          37    86       0.462 
##  5     4          28    86       0.359 
##  6     5          20    86       0.263 
##  7     6          13    86       0.171 
##  8     7          10    86       0.133 
##  9     8           7    86       0.0972
## 10     9           8    86       0.111 
## 11    10           6    86       0.0857
## 12    11           6    86       0.0870
# Build models
uncond_ri <- lme4::glmer(symptom ~ 1 + month + (1|id), data = madras, family = binomial)
summary(uncond_ri)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: symptom ~ 1 + month + (1 | id)
##    Data: madras
## 
##      AIC      BIC   logLik deviance df.resid 
##    764.0    778.5   -379.0    758.0      919 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -6.7679 -0.3539 -0.1316  0.3253  6.3572 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  id     (Intercept) 4.961    2.227   
## Number of obs: 922, groups:  id, 86
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.07446    0.30866   3.481 0.000499 ***
## month       -0.54049    0.04451 -12.143  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## month -0.457
# Add in covariates based on trend plot
cond_model <- lme4::glmer(symptom ~ 1 + month + sex + age + sex:age + sex:month + (1 | id), data = madras, family = binomial)

# Generate summary of output
summary(cond_model)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: symptom ~ 1 + month + sex + age + sex:age + sex:month + (1 |  
##     id)
##    Data: madras
## 
##      AIC      BIC   logLik deviance df.resid 
##    760.2    794.0   -373.1    746.2      915 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.7862 -0.3770 -0.1273  0.3201  9.4814 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  id     (Intercept) 4.305    2.075   
## Number of obs: 922, groups:  id, 86
## 
## Fixed effects:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              1.60467    0.60890   2.635   0.0084 ** 
## month                   -0.63718    0.07406  -8.604   <2e-16 ***
## sexMale                 -0.55037    0.92853  -0.593   0.5534    
## ageLess than 20         -1.49168    0.75966  -1.964   0.0496 *  
## sexMale:ageLess than 20  1.98705    1.09623   1.813   0.0699 .  
## month:sexMale            0.15234    0.08694   1.752   0.0797 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) month  sexMal agLt20 sM:Lt2
## month       -0.415                            
## sexMale     -0.655  0.270                     
## ageLssthn20 -0.707  0.120  0.462              
## sxMl:gLst20  0.492 -0.093 -0.762 -0.697       
## month:sexMl  0.342 -0.800 -0.352 -0.080  0.057
# Fit a GEE model with intercept and time variable
gee_mod <- geepack::geeglm(symptom ~ 1 + month, id = id, family=binomial, data = madras, scale.fix = TRUE)

# Extract model results
summary(gee_mod)
## 
## Call:
## geepack::geeglm(formula = symptom ~ 1 + month, family = binomial, 
##     data = madras, id = id, scale.fix = TRUE)
## 
##  Coefficients:
##             Estimate  Std.err  Wald Pr(>|W|)    
## (Intercept)  0.67587  0.21077 10.28  0.00134 ** 
## month       -0.32535  0.04452 53.42  2.7e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale is fixed.
## 
## Correlation: Structure = independenceNumber of clusters:   86   Maximum cluster size: 12
# Fit a GEE model with an ar(1) working correlation matrix
gee_mod_ar1 <- geepack::geeglm(symptom ~ 1 + month, id = id, family = binomial, data = madras, corstr="ar1", scale.fix = TRUE)

# Fit a GEE model with an unstructured working correlation matrix
gee_mod_un <- geepack::geeglm(symptom ~ 1 + month, id = id, family = binomial, data = madras, corstr="unstructured", scale.fix = TRUE)

# Extract model results
summary(gee_mod_ar1)
## 
## Call:
## geepack::geeglm(formula = symptom ~ 1 + month, family = binomial, 
##     data = madras, id = id, corstr = "ar1", scale.fix = TRUE)
## 
##  Coefficients:
##             Estimate Std.err Wald Pr(>|W|)    
## (Intercept)    0.634   0.199 10.1   0.0015 ** 
## month         -0.308   0.041 56.6  5.3e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale is fixed.
## 
## Correlation: Structure = ar1  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.678  0.0451
## Number of clusters:   86   Maximum cluster size: 12
summary(gee_mod_un)
## 
## Call:
## geepack::geeglm(formula = symptom ~ 1 + month, family = binomial, 
##     data = madras, id = id, corstr = "unstructured", scale.fix = TRUE)
## 
##  Coefficients:
##             Estimate Std.err Wald Pr(>|W|)  
## (Intercept)   -1.607   0.950 2.86    0.091 .
## month          1.008   0.578 3.03    0.082 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale is fixed.
## 
## Correlation: Structure = unstructured  Link = identity 
## 
## Estimated Correlation Parameters:
##             Estimate  Std.err
## alpha.1:2      1.341 1.03e+00
## alpha.1:3      0.403 5.58e-01
## alpha.1:4     -0.497 5.61e-01
## alpha.1:5     -1.882 6.12e-01
## alpha.1:6     -4.397 3.73e+00
## alpha.1:7     -9.542 1.26e+01
## alpha.1:8    -16.387 2.79e+01
## alpha.1:9    -28.267 5.83e+01
## alpha.1:10   -45.536 1.09e+02
## alpha.1:11   -77.518 2.10e+02
## alpha.1:12  -126.572 3.83e+02
## alpha.2:3      0.547 1.52e-01
## alpha.2:4      0.248 4.80e-01
## alpha.2:5     -0.446 6.41e-01
## alpha.2:6     -1.154 6.25e-01
## alpha.2:7     -2.836 1.89e+00
## alpha.2:8     -6.135 7.38e+00
## alpha.2:9    -12.003 1.95e+01
## alpha.2:10   -19.110 3.80e+01
## alpha.2:11   -30.538 7.06e+01
## alpha.2:12   -49.110 1.29e+02
## alpha.3:4      0.962 5.84e-01
## alpha.3:5      1.070 1.44e+00
## alpha.3:6      1.411 2.71e+00
## alpha.3:7      1.397 3.87e+00
## alpha.3:8      2.825 7.08e+00
## alpha.3:9      4.176 1.12e+01
## alpha.3:10     6.223 1.71e+01
## alpha.3:11     8.645 2.56e+01
## alpha.3:12    15.820 4.76e+01
## alpha.4:5      2.990 3.42e+00
## alpha.4:6      4.473 6.70e+00
## alpha.4:7      6.695 1.20e+01
## alpha.4:8     12.036 2.41e+01
## alpha.4:9     19.898 4.44e+01
## alpha.4:10    31.794 7.92e+01
## alpha.4:11    54.121 1.48e+02
## alpha.4:12    91.655 2.75e+02
## alpha.5:6     10.186 1.73e+01
## alpha.5:7     15.764 3.14e+01
## alpha.5:8     27.066 6.09e+01
## alpha.5:9     46.677 1.17e+02
## alpha.5:10    77.419 2.15e+02
## alpha.5:11   131.788 4.02e+02
## alpha.5:12   215.885 7.19e+02
## alpha.6:7     34.951 7.97e+01
## alpha.6:8     57.428 1.47e+02
## alpha.6:9     95.123 2.70e+02
## alpha.6:10   157.530 4.90e+02
## alpha.6:11   262.681 8.91e+02
## alpha.6:12   432.122 1.59e+03
## alpha.7:8    106.445 3.04e+02
## alpha.7:9    170.915 5.38e+02
## alpha.7:10   282.934 9.69e+02
## alpha.7:11   481.633 1.78e+03
## alpha.7:12   793.897 3.17e+03
## alpha.8:9    314.324 1.08e+03
## alpha.8:10   503.088 1.87e+03
## alpha.8:11   841.802 3.38e+03
## alpha.8:12  1388.948 5.97e+03
## alpha.9:10   861.042 3.46e+03
## alpha.9:11  1465.730 6.30e+03
## alpha.9:12  2420.508 1.11e+04
## alpha.10:11 2465.610 1.13e+04
## alpha.10:12 4072.802 1.99e+04
## alpha.11:12 6961.365 3.59e+04
## Number of clusters:   86   Maximum cluster size: 12
# Fit a GEE model with an ar(1) working correlation matrix
gee_mod_ar1 <- geepack::geeglm(symptom ~ 1 + month + age + sex + age:sex + month:age + month:sex, id = id, data = madras, family = binomial, corstr = 'ar1', scale.fix = TRUE)

# Extract model results
summary(gee_mod_ar1)
## 
## Call:
## geepack::geeglm(formula = symptom ~ 1 + month + age + sex + age:sex + 
##     month:age + month:sex, family = binomial, data = madras, 
##     id = id, corstr = "ar1", scale.fix = TRUE)
## 
##  Coefficients:
##                         Estimate Std.err  Wald Pr(>|W|)    
## (Intercept)               1.3287  0.4784  7.71   0.0055 ** 
## month                    -0.4819  0.0792 37.07  1.1e-09 ***
## ageLess than 20          -1.0874  0.5472  3.95   0.0469 *  
## sexMale                  -0.6198  0.6748  0.84   0.3583    
## ageLess than 20:sexMale   1.0846  0.7136  2.31   0.1285    
## month:ageLess than 20     0.0637  0.0895  0.51   0.4766    
## month:sexMale             0.1763  0.0980  3.24   0.0720 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale is fixed.
## 
## Correlation: Structure = ar1  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.624  0.0413
## Number of clusters:   86   Maximum cluster size: 12
# Fit a GEE model with an exchangeable working correlation matrix
gee_mod_exch <- geepack::geeglm(symptom ~ 1 + month + age + sex, data = madras, id = id, family = binomial, scale.fix = TRUE, corstr = 'exchangeable')


glmm_age <- lme4::glmer(symptom ~ 1 + month + age + month:age + (1 | id), data=madras, family=binomial)

# Generate model implied probabilities
madras <- madras %>%
    na.omit() %>%
    mutate(prob=predict(glmm_age, type="response"))

# Visualize subject specific probabilities
ggplot(madras, aes(x=month, y=prob)) + 
    geom_line(aes(group=id)) + 
    theme_bw() +
    xlab("Month") + 
    ylab("Probabilities")

# Compute the average trajectories
glmm_prob <- madras %>%
    group_by(month, age) %>%
    summarize(prob = mean(prob))

# Visualize average trajectories
ggplot(glmm_prob, aes(x = month, y = prob)) + 
    geom_line(aes(color = age)) + 
    theme_bw() +
    xlab("Month") + 
    ylab("Probability")

gee_age_sex <- geepack::geeglm(formula = symptom ~ 1 + month + age + month:age + sex, 
                               family = binomial, data = madras, id = id, corstr = "ar1", scale.fix = TRUE
                               )

# Compute model implied probabilites using gee_age_sex
madras_gee <- madras %>% 
    select(month, symptom, age, sex) %>% 
    na.omit() %>%
    mutate(prob = predict(gee_age_sex, type = "response"))

# Visualize trajectories
ggplot(madras_gee, aes(x = month, y = prob)) + 
    geom_line(aes(color = age)) + 
    facet_wrap(~ sex) + 
    theme_bw() +
    xlab("Month") +
    ylab("Probability")

# Fit a GLMM mdoel
glmm_age_sex <- lme4::glmer(symptom ~ 1 + month + age + sex + (1 | id), data = madras, family = binomial)

# Generate model implied probabilites
madras <- madras %>% mutate(prob_gee = predict(gee_age_sex, type = "response"),
                            prob_glmm = predict(glmm_age_sex, type = "response")
                            )

# Compute average GEE probabilities
madras_gee <- madras %>%
    group_by(month, age, sex) %>%
    summarize(prob = mean(prob_gee))

# Compute average GLMM probabilities
madras_glmm <- madras %>%
    group_by(month, age, sex) %>%
    summarize(prob = mean(prob_glmm))

# Create combined data object
madras_agg = bind_rows(
    mutate(madras_glmm, model = "GLMM"), 
    mutate(madras_gee, model = "GEE") 
)


# Visualize differences in trajectories across model types
ggplot(madras_agg, aes(x = month, y = prob)) + 
    geom_line(aes(color = sex, linetype = model)) + 
    facet_wrap(~ age) + 
    theme_bw() + 
    xlab("Month") + 
    ylab("Probability")


Data Manipulation in R with data.table

Chapter 1 - Introduction to data.table

Introduction:

  • A data.table is a data.frame with extended capabilities
    • DT[i, j, by] # which rows, what to do (operates on columns), grouped by what
  • The data.table is fast and parallelizes operations where possible, and is feature-rich
  • There are at least three ways to create a data.table
    • data.table()
    • as.data.table()
    • fread()
    • x <- data.table(id = 1:2, name = c(“a”, “b”))
  • All functions that can be used on data.frame are available on data.table
    • Note that data.table does not convert characters to factors, does not set row names, and uses colon to separate row number from the row data

Filtering Rows in a data.table:

  • The data.table syntax consists of DT[i, j, by] # which rows, what to do (operates on columns), grouped by what
    • batrips[3:4]
    • batrips[3:4, ] # same as above
    • batrips[-(1:5)] # exclude rows 1:5
    • batrips[!(1:5)] # exclude rows 1:5
  • The data.table has some special symbols that help with calculations
    • .N is an integer value that contains the number of rows in the data.table (Particularly useful alternative to nrow(x) in i)
    • batrips[.N]
    • ans <- batrips[1:(.N-10)]
    • batrips[subscription_type == “Subscriber”] # if this is a data.table
    • batrips[batrips$subscription_type == “Subscriber”, ] # required if it is only a data.frame
    • batrips[start_terminal == 58 & end_terminal != 65] # if this is a data.table
  • The data.table automatically creates a key of the columns used to subset the data (future operations on that column will be much faster)
    • dt <- data.table(x = sample(10000, 10e6, TRUE), y = sample(letters, 1e6, TRUE))
    • indices(dt)
    • system.time(dt[x == 900])
    • indices(dt)
    • system.time(dt[x == 900])

Helpers for filtering:

  • The %like% operator allows for matching a pattern in a column
    • batrips[start_station %like% “^San Francisco”] # for data.table, with the ^ being from regex for “starts with”
    • batrips[grepl(“^San Francisco”, start_station)] # equivalent for data.frame
  • The %between% operator allows for finding items in the closed interval (a, b)
    • batrips[duration %between% c(2000, 3000)] # using data.table
    • batrips[duration >= 2000 & duration <= 3000] # equivalent for data.frame
  • The %chin% operator allows for %in% for character vectors
    • batrips[start_station %chin% c(“Japantown”, “Mezes Park”, “MLK Library”)] # for data.table, runs MUCH faster
    • batrips[start_station %in% c(“Japantown”, “Mezes Park”, “MLK Library”)] # alternate, slower syntax

Example code includes:

# Load data.table
library(data.table)

# Create the data.table X 
X <- data.table(id = c("a", "b", "c"), value = c(0.5, 1.0, 1.5))

# View X
X
##    id value
## 1:  a   0.5
## 2:  b   1.0
## 3:  c   1.5
data(batrips, package="bikeshare14")
batrips <- as.data.table(batrips)

# Get number of columns in batrips
col_number <- ncol(batrips)

# Print the first 8 rows
head(batrips, 8)
##    trip_id duration          start_date           start_station
## 1:  139545      435 2014-01-01 00:14:00 San Francisco City Hall
## 2:  139546      432 2014-01-01 00:14:00 San Francisco City Hall
## 3:  139547     1523 2014-01-01 00:17:00  Embarcadero at Sansome
## 4:  139549     1620 2014-01-01 00:23:00       Steuart at Market
## 5:  139550     1617 2014-01-01 00:23:00       Steuart at Market
## 6:  139551      779 2014-01-01 00:24:00       Steuart at Market
## 7:  139552      784 2014-01-01 00:24:00       Steuart at Market
## 8:  139553      721 2014-01-01 00:25:00       Steuart at Market
##    start_terminal            end_date           end_station end_terminal
## 1:             58 2014-01-01 00:21:00       Townsend at 7th           65
## 2:             58 2014-01-01 00:21:00       Townsend at 7th           65
## 3:             60 2014-01-01 00:42:00       Beale at Market           56
## 4:             74 2014-01-01 00:50:00    Powell Street BART           39
## 5:             74 2014-01-01 00:50:00    Powell Street BART           39
## 6:             74 2014-01-01 00:37:00 Washington at Kearney           46
## 7:             74 2014-01-01 00:37:00 Washington at Kearney           46
## 8:             74 2014-01-01 00:37:00 Washington at Kearney           46
##    bike_id subscription_type zip_code
## 1:     473        Subscriber    94612
## 2:     395        Subscriber    94107
## 3:     331        Subscriber    94112
## 4:     605          Customer    92007
## 5:     453          Customer    92007
## 6:     335          Customer    94109
## 7:     580          Customer         
## 8:     563          Customer    94109
# Print the last 8 rows
tail(batrips, 8)
##    trip_id duration          start_date                   start_station
## 1:  588907      770 2014-12-31 22:51:00                 Townsend at 7th
## 2:  588909      992 2014-12-31 23:06:00            Washington at Kearny
## 3:  588908     1004 2014-12-31 23:06:00            Washington at Kearny
## 4:  588910      437 2014-12-31 23:18:00              Powell Street BART
## 5:  588911      422 2014-12-31 23:19:00 Grant Avenue at Columbus Avenue
## 6:  588912     1487 2014-12-31 23:31:00        South Van Ness at Market
## 7:  588913     1458 2014-12-31 23:32:00        South Van Ness at Market
## 8:  588914      364 2014-12-31 23:33:00           Embarcadero at Bryant
##    start_terminal            end_date
## 1:             65 2014-12-31 23:04:00
## 2:             46 2014-12-31 23:23:00
## 3:             46 2014-12-31 23:23:00
## 4:             39 2014-12-31 23:25:00
## 5:             73 2014-12-31 23:26:00
## 6:             66 2014-12-31 23:56:00
## 7:             66 2014-12-31 23:56:00
## 8:             54 2014-12-31 23:40:00
##                                      end_station end_terminal bike_id
## 1:                                 Howard at 2nd           63     677
## 2:                        Embarcadero at Vallejo           48     485
## 3:                        Embarcadero at Vallejo           48     419
## 4:      San Francisco Caltrain (Townsend at 4th)           70     573
## 5: Yerba Buena Center of the Arts (3rd @ Howard)           68     604
## 6:                             Steuart at Market           74     480
## 7:                             Steuart at Market           74     277
## 8:                                 Howard at 2nd           63      56
##    subscription_type zip_code
## 1:        Subscriber    94107
## 2:          Customer    92104
## 3:          Customer    92104
## 4:        Subscriber    95050
## 5:        Subscriber    94133
## 6:          Customer    94109
## 7:          Customer    94109
## 8:        Subscriber    94105
# Print the structure of batrips
str(batrips)
## Classes 'data.table' and 'data.frame':   326339 obs. of  11 variables:
##  $ trip_id          : int  139545 139546 139547 139549 139550 139551 139552 139553 139554 139555 ...
##  $ duration         : int  435 432 1523 1620 1617 779 784 721 624 574 ...
##  $ start_date       : POSIXct, format: "2014-01-01 00:14:00" "2014-01-01 00:14:00" ...
##  $ start_station    : chr  "San Francisco City Hall" "San Francisco City Hall" "Embarcadero at Sansome" "Steuart at Market" ...
##  $ start_terminal   : int  58 58 60 74 74 74 74 74 57 57 ...
##  $ end_date         : POSIXct, format: "2014-01-01 00:21:00" "2014-01-01 00:21:00" ...
##  $ end_station      : chr  "Townsend at 7th" "Townsend at 7th" "Beale at Market" "Powell Street BART" ...
##  $ end_terminal     : int  65 65 56 39 39 46 46 46 68 68 ...
##  $ bike_id          : int  473 395 331 605 453 335 580 563 358 365 ...
##  $ subscription_type: chr  "Subscriber" "Subscriber" "Subscriber" "Customer" ...
##  $ zip_code         : chr  "94612" "94107" "94112" "92007" ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Filter third row
row_3 <- batrips[3]
row_3
##    trip_id duration          start_date          start_station
## 1:  139547     1523 2014-01-01 00:17:00 Embarcadero at Sansome
##    start_terminal            end_date     end_station end_terminal bike_id
## 1:             60 2014-01-01 00:42:00 Beale at Market           56     331
##    subscription_type zip_code
## 1:        Subscriber    94112
# Filter rows 10 through 20
rows_10_20 <- batrips[10:20]
rows_10_20
##     trip_id duration          start_date
##  1:  139555      574 2014-01-01 00:25:00
##  2:  139558     1600 2014-01-01 00:28:00
##  3:  139559     3691 2014-01-01 00:32:00
##  4:  139560     3793 2014-01-01 00:32:00
##  5:  139561     3788 2014-01-01 00:32:00
##  6:  139562     3626 2014-01-01 00:33:00
##  7:  139563      805 2014-01-01 00:33:00
##  8:  139564      769 2014-01-01 00:34:00
##  9:  139565     1157 2014-01-01 00:36:00
## 10:  139566     1159 2014-01-01 00:37:00
## 11:  139567      477 2014-01-01 00:39:00
##                            start_station start_terminal
##  1:                        5th at Howard             57
##  2: Harry Bridges Plaza (Ferry Building)             50
##  3:                    Steuart at Market             74
##  4:                    Steuart at Market             74
##  5:                    Steuart at Market             74
##  6:                    Steuart at Market             74
##  7:                    Steuart at Market             74
##  8:                    Steuart at Market             74
##  9:                    Steuart at Market             74
## 10:                    Steuart at Market             74
## 11:                      Beale at Market             56
##                end_date                                   end_station
##  1: 2014-01-01 00:35:00 Yerba Buena Center of the Arts (3rd @ Howard)
##  2: 2014-01-01 00:54:00                             Steuart at Market
##  3: 2014-01-01 01:33:00                             Steuart at Market
##  4: 2014-01-01 01:35:00                             Steuart at Market
##  5: 2014-01-01 01:35:00                             Steuart at Market
##  6: 2014-01-01 01:33:00                             Steuart at Market
##  7: 2014-01-01 00:47:00 Yerba Buena Center of the Arts (3rd @ Howard)
##  8: 2014-01-01 00:47:00 Yerba Buena Center of the Arts (3rd @ Howard)
##  9: 2014-01-01 00:56:00             Civic Center BART (7th at Market)
## 10: 2014-01-01 00:56:00             Civic Center BART (7th at Market)
## 11: 2014-01-01 00:47:00                               Spear at Folsom
##     end_terminal bike_id subscription_type zip_code
##  1:           68     365          Customer    94941
##  2:           74     413        Subscriber    94102
##  3:           74     619          Customer    94070
##  4:           74     311          Customer    55417
##  5:           74     577          Customer    55417
##  6:           74     271          Customer    94070
##  7:           68     508          Customer         
##  8:           68     384          Customer         
##  9:           72     276          Customer         
## 10:           72     423          Customer         
## 11:           49     265          Customer    94105
# Filter the 1st, 6th and 10th rows
rows_1_6_10 <- batrips[c(1, 6, 10)]
rows_1_6_10
##    trip_id duration          start_date           start_station
## 1:  139545      435 2014-01-01 00:14:00 San Francisco City Hall
## 2:  139551      779 2014-01-01 00:24:00       Steuart at Market
## 3:  139555      574 2014-01-01 00:25:00           5th at Howard
##    start_terminal            end_date
## 1:             58 2014-01-01 00:21:00
## 2:             74 2014-01-01 00:37:00
## 3:             57 2014-01-01 00:35:00
##                                      end_station end_terminal bike_id
## 1:                               Townsend at 7th           65     473
## 2:                         Washington at Kearney           46     335
## 3: Yerba Buena Center of the Arts (3rd @ Howard)           68     365
##    subscription_type zip_code
## 1:        Subscriber    94612
## 2:          Customer    94109
## 3:          Customer    94941
# Select all rows except the first two
not_first_two <- batrips[-c(1:2)]
not_first_two
##         trip_id duration          start_date
##      1:  139547     1523 2014-01-01 00:17:00
##      2:  139549     1620 2014-01-01 00:23:00
##      3:  139550     1617 2014-01-01 00:23:00
##      4:  139551      779 2014-01-01 00:24:00
##      5:  139552      784 2014-01-01 00:24:00
##     ---                                     
## 326333:  588910      437 2014-12-31 23:18:00
## 326334:  588911      422 2014-12-31 23:19:00
## 326335:  588912     1487 2014-12-31 23:31:00
## 326336:  588913     1458 2014-12-31 23:32:00
## 326337:  588914      364 2014-12-31 23:33:00
##                           start_station start_terminal            end_date
##      1:          Embarcadero at Sansome             60 2014-01-01 00:42:00
##      2:               Steuart at Market             74 2014-01-01 00:50:00
##      3:               Steuart at Market             74 2014-01-01 00:50:00
##      4:               Steuart at Market             74 2014-01-01 00:37:00
##      5:               Steuart at Market             74 2014-01-01 00:37:00
##     ---                                                                   
## 326333:              Powell Street BART             39 2014-12-31 23:25:00
## 326334: Grant Avenue at Columbus Avenue             73 2014-12-31 23:26:00
## 326335:        South Van Ness at Market             66 2014-12-31 23:56:00
## 326336:        South Van Ness at Market             66 2014-12-31 23:56:00
## 326337:           Embarcadero at Bryant             54 2014-12-31 23:40:00
##                                           end_station end_terminal bike_id
##      1:                               Beale at Market           56     331
##      2:                            Powell Street BART           39     605
##      3:                            Powell Street BART           39     453
##      4:                         Washington at Kearney           46     335
##      5:                         Washington at Kearney           46     580
##     ---                                                                   
## 326333:      San Francisco Caltrain (Townsend at 4th)           70     573
## 326334: Yerba Buena Center of the Arts (3rd @ Howard)           68     604
## 326335:                             Steuart at Market           74     480
## 326336:                             Steuart at Market           74     277
## 326337:                                 Howard at 2nd           63      56
##         subscription_type zip_code
##      1:        Subscriber    94112
##      2:          Customer    92007
##      3:          Customer    92007
##      4:          Customer    94109
##      5:          Customer         
##     ---                           
## 326333:        Subscriber    95050
## 326334:        Subscriber    94133
## 326335:          Customer    94109
## 326336:          Customer    94109
## 326337:        Subscriber    94105
# Select all rows except 1 through 5 and 10 through 15
exclude_some <- batrips[-c(1:5, 10:15)]
exclude_some
##         trip_id duration          start_date
##      1:  139551      779 2014-01-01 00:24:00
##      2:  139552      784 2014-01-01 00:24:00
##      3:  139553      721 2014-01-01 00:25:00
##      4:  139554      624 2014-01-01 00:25:00
##      5:  139563      805 2014-01-01 00:33:00
##     ---                                     
## 326324:  588910      437 2014-12-31 23:18:00
## 326325:  588911      422 2014-12-31 23:19:00
## 326326:  588912     1487 2014-12-31 23:31:00
## 326327:  588913     1458 2014-12-31 23:32:00
## 326328:  588914      364 2014-12-31 23:33:00
##                           start_station start_terminal            end_date
##      1:               Steuart at Market             74 2014-01-01 00:37:00
##      2:               Steuart at Market             74 2014-01-01 00:37:00
##      3:               Steuart at Market             74 2014-01-01 00:37:00
##      4:                   5th at Howard             57 2014-01-01 00:35:00
##      5:               Steuart at Market             74 2014-01-01 00:47:00
##     ---                                                                   
## 326324:              Powell Street BART             39 2014-12-31 23:25:00
## 326325: Grant Avenue at Columbus Avenue             73 2014-12-31 23:26:00
## 326326:        South Van Ness at Market             66 2014-12-31 23:56:00
## 326327:        South Van Ness at Market             66 2014-12-31 23:56:00
## 326328:           Embarcadero at Bryant             54 2014-12-31 23:40:00
##                                           end_station end_terminal bike_id
##      1:                         Washington at Kearney           46     335
##      2:                         Washington at Kearney           46     580
##      3:                         Washington at Kearney           46     563
##      4: Yerba Buena Center of the Arts (3rd @ Howard)           68     358
##      5: Yerba Buena Center of the Arts (3rd @ Howard)           68     508
##     ---                                                                   
## 326324:      San Francisco Caltrain (Townsend at 4th)           70     573
## 326325: Yerba Buena Center of the Arts (3rd @ Howard)           68     604
## 326326:                             Steuart at Market           74     480
## 326327:                             Steuart at Market           74     277
## 326328:                                 Howard at 2nd           63      56
##         subscription_type zip_code
##      1:          Customer    94109
##      2:          Customer         
##      3:          Customer    94109
##      4:          Customer    94941
##      5:          Customer         
##     ---                           
## 326324:        Subscriber    95050
## 326325:        Subscriber    94133
## 326326:          Customer    94109
## 326327:          Customer    94109
## 326328:        Subscriber    94105
# Select all rows except the first and last
not_first_last <- batrips[-c(1, .N)]
not_first_last
##         trip_id duration          start_date
##      1:  139546      432 2014-01-01 00:14:00
##      2:  139547     1523 2014-01-01 00:17:00
##      3:  139549     1620 2014-01-01 00:23:00
##      4:  139550     1617 2014-01-01 00:23:00
##      5:  139551      779 2014-01-01 00:24:00
##     ---                                     
## 326333:  588908     1004 2014-12-31 23:06:00
## 326334:  588910      437 2014-12-31 23:18:00
## 326335:  588911      422 2014-12-31 23:19:00
## 326336:  588912     1487 2014-12-31 23:31:00
## 326337:  588913     1458 2014-12-31 23:32:00
##                           start_station start_terminal            end_date
##      1:         San Francisco City Hall             58 2014-01-01 00:21:00
##      2:          Embarcadero at Sansome             60 2014-01-01 00:42:00
##      3:               Steuart at Market             74 2014-01-01 00:50:00
##      4:               Steuart at Market             74 2014-01-01 00:50:00
##      5:               Steuart at Market             74 2014-01-01 00:37:00
##     ---                                                                   
## 326333:            Washington at Kearny             46 2014-12-31 23:23:00
## 326334:              Powell Street BART             39 2014-12-31 23:25:00
## 326335: Grant Avenue at Columbus Avenue             73 2014-12-31 23:26:00
## 326336:        South Van Ness at Market             66 2014-12-31 23:56:00
## 326337:        South Van Ness at Market             66 2014-12-31 23:56:00
##                                           end_station end_terminal bike_id
##      1:                               Townsend at 7th           65     395
##      2:                               Beale at Market           56     331
##      3:                            Powell Street BART           39     605
##      4:                            Powell Street BART           39     453
##      5:                         Washington at Kearney           46     335
##     ---                                                                   
## 326333:                        Embarcadero at Vallejo           48     419
## 326334:      San Francisco Caltrain (Townsend at 4th)           70     573
## 326335: Yerba Buena Center of the Arts (3rd @ Howard)           68     604
## 326336:                             Steuart at Market           74     480
## 326337:                             Steuart at Market           74     277
##         subscription_type zip_code
##      1:        Subscriber    94107
##      2:        Subscriber    94112
##      3:          Customer    92007
##      4:          Customer    92007
##      5:          Customer    94109
##     ---                           
## 326333:          Customer    92104
## 326334:        Subscriber    95050
## 326335:        Subscriber    94133
## 326336:          Customer    94109
## 326337:          Customer    94109
# Filter all rows where start_station is "MLK Library"
trips_mlk <- batrips[start_station == "MLK Library"]
trips_mlk
##      trip_id duration          start_date start_station start_terminal
##   1:  140683      426 2014-01-02 16:07:00   MLK Library             11
##   2:  140787      333 2014-01-02 16:51:00   MLK Library             11
##   3:  141476      467 2014-01-03 10:08:00   MLK Library             11
##   4:  141641      246 2014-01-03 12:32:00   MLK Library             11
##   5:  141830      380 2014-01-03 15:56:00   MLK Library             11
##  ---                                                                  
## 831:  588367    34719 2014-12-31 01:08:00   MLK Library             11
## 832:  588366    34715 2014-12-31 01:08:00   MLK Library             11
## 833:  588368    34595 2014-12-31 01:10:00   MLK Library             11
## 834:  588429      692 2014-12-31 07:29:00   MLK Library             11
## 835:  588430      599 2014-12-31 07:30:00   MLK Library             11
##                 end_date                       end_station end_terminal
##   1: 2014-01-02 16:14:00 San Jose Diridon Caltrain Station            2
##   2: 2014-01-02 16:56:00                  Adobe on Almaden            5
##   3: 2014-01-03 10:15:00              Paseo de San Antonio            7
##   4: 2014-01-03 12:37:00               San Salvador at 1st            8
##   5: 2014-01-03 16:03:00 San Jose Diridon Caltrain Station            2
##  ---                                                                   
## 831: 2014-12-31 10:47:00                       MLK Library           11
## 832: 2014-12-31 10:47:00                       MLK Library           11
## 833: 2014-12-31 10:47:00                       MLK Library           11
## 834: 2014-12-31 07:40:00 San Jose Diridon Caltrain Station            2
## 835: 2014-12-31 07:40:00 San Jose Diridon Caltrain Station            2
##      bike_id subscription_type zip_code
##   1:     657        Subscriber    94043
##   2:     140        Subscriber    94536
##   3:     182        Subscriber    95035
##   4:     665        Subscriber    95112
##   5:     176        Subscriber    94043
##  ---                                   
## 831:     128          Customer    95112
## 832:     667          Customer    95112
## 833:     241          Customer    95076
## 834:     180        Subscriber    95112
## 835:     682        Subscriber    95113
# Filter all rows where start_station is "MLK Library" AND duration > 1600
trips_mlk_1600 <- batrips[start_station == "MLK Library" & duration > 1600]
trips_mlk_1600
##     trip_id duration          start_date start_station start_terminal
##  1:  147733     1744 2014-01-09 11:47:00   MLK Library             11
##  2:  158900    61848 2014-01-19 16:42:00   MLK Library             11
##  3:  159736     1665 2014-01-20 18:38:00   MLK Library             11
##  4:  159737     1671 2014-01-20 18:38:00   MLK Library             11
##  5:  166370     6125 2014-01-26 14:12:00   MLK Library             11
##  6:  166371     6039 2014-01-26 14:13:00   MLK Library             11
##  7:  171845     4629 2014-01-31 01:07:00   MLK Library             11
##  8:  176588     1781 2014-02-05 09:03:00   MLK Library             11
##  9:  180917    68778 2014-02-11 15:16:00   MLK Library             11
## 10:  207319     1615 2014-03-09 17:04:00   MLK Library             11
## 11:  207641     4490 2014-03-10 08:06:00   MLK Library             11
## 12:  215277     1652 2014-03-15 15:36:00   MLK Library             11
## 13:  215829    26719 2014-03-16 14:09:00   MLK Library             11
## 14:  215835    26606 2014-03-16 14:11:00   MLK Library             11
## 15:  224012     3553 2014-03-22 19:16:00   MLK Library             11
## 16:  224013     3204 2014-03-22 19:22:00   MLK Library             11
## 17:  224304     9506 2014-03-23 14:37:00   MLK Library             11
## 18:  224305     9362 2014-03-23 14:40:00   MLK Library             11
## 19:  231184     2336 2014-03-30 08:48:00   MLK Library             11
## 20:  231461     1839 2014-03-30 15:34:00   MLK Library             11
## 21:  233561     5145 2014-04-02 08:01:00   MLK Library             11
## 22:  237605     6406 2014-04-05 14:23:00   MLK Library             11
## 23:  238354     7041 2014-04-06 18:26:00   MLK Library             11
## 24:  238964     2474 2014-04-07 10:50:00   MLK Library             11
## 25:  243981    22350 2014-04-10 16:34:00   MLK Library             11
## 26:  244441    38525 2014-04-11 00:45:00   MLK Library             11
## 27:  245171     1727 2014-04-11 15:07:00   MLK Library             11
## 28:  245173     1682 2014-04-11 15:08:00   MLK Library             11
## 29:  245208     2245 2014-04-11 15:39:00   MLK Library             11
## 30:  245212     2149 2014-04-11 15:40:00   MLK Library             11
## 31:  245977    25692 2014-04-12 13:47:00   MLK Library             11
## 32:  258870     6557 2014-04-23 16:43:00   MLK Library             11
## 33:  262804     3582 2014-04-28 08:17:00   MLK Library             11
## 34:  271145     1631 2014-05-04 15:19:00   MLK Library             11
## 35:  275500   168395 2014-05-07 15:47:00   MLK Library             11
## 36:  275926     5390 2014-05-07 18:45:00   MLK Library             11
## 37:  275928     5332 2014-05-07 18:46:00   MLK Library             11
## 38:  276189     1705 2014-05-08 06:57:00   MLK Library             11
## 39:  278974    11620 2014-05-10 09:47:00   MLK Library             11
## 40:  286203     2105 2014-05-15 19:43:00   MLK Library             11
## 41:  287783    18184 2014-05-16 20:07:00   MLK Library             11
## 42:  302706     8035 2014-05-29 12:23:00   MLK Library             11
## 43:  302712     7860 2014-05-29 12:25:00   MLK Library             11
## 44:  305126     4874 2014-05-31 11:20:00   MLK Library             11
## 45:  305130     9338 2014-05-31 11:22:00   MLK Library             11
## 46:  315821    15235 2014-06-08 21:22:00   MLK Library             11
## 47:  316695     4403 2014-06-09 13:23:00   MLK Library             11
## 48:  316694     4398 2014-06-09 13:23:00   MLK Library             11
## 49:  324444     1723 2014-06-14 16:21:00   MLK Library             11
## 50:  324446     1700 2014-06-14 16:22:00   MLK Library             11
## 51:  325172     2964 2014-06-15 19:53:00   MLK Library             11
## 52:  325174     2796 2014-06-15 19:56:00   MLK Library             11
## 53:  326021     2477 2014-06-16 12:35:00   MLK Library             11
## 54:  329525     6966 2014-06-18 13:34:00   MLK Library             11
## 55:  331197    20671 2014-06-19 14:30:00   MLK Library             11
## 56:  331253    16797 2014-06-19 15:35:00   MLK Library             11
## 57:  331932     1913 2014-06-19 21:16:00   MLK Library             11
## 58:  336447     2626 2014-06-24 07:33:00   MLK Library             11
## 59:  339011    19511 2014-06-25 14:06:00   MLK Library             11
## 60:  339013    15690 2014-06-25 14:07:00   MLK Library             11
## 61:  345918     2579 2014-06-30 20:46:00   MLK Library             11
## 62:  346850     7419 2014-07-01 15:34:00   MLK Library             11
## 63:  351202    75126 2014-07-04 23:06:00   MLK Library             11
## 64:  370599    14793 2014-07-18 20:16:00   MLK Library             11
## 65:  391024     7450 2014-08-01 19:21:00   MLK Library             11
## 66:  391894     9001 2014-08-03 12:04:00   MLK Library             11
## 67:  407384     1920 2014-08-13 16:18:00   MLK Library             11
## 68:  407382     1928 2014-08-13 16:18:00   MLK Library             11
## 69:  410332     8303 2014-08-15 11:20:00   MLK Library             11
## 70:  410333     8210 2014-08-15 11:22:00   MLK Library             11
## 71:  410336     8066 2014-08-15 11:24:00   MLK Library             11
## 72:  411884     2431 2014-08-16 23:26:00   MLK Library             11
## 73:  412323    10611 2014-08-17 18:14:00   MLK Library             11
## 74:  427235    79197 2014-08-27 13:31:00   MLK Library             11
## 75:  428990    12149 2014-08-28 11:43:00   MLK Library             11
## 76:  428989    12148 2014-08-28 11:43:00   MLK Library             11
## 77:  461834    19167 2014-09-19 20:13:00   MLK Library             11
## 78:  461993    12590 2014-09-20 10:13:00   MLK Library             11
## 79:  468000     1711 2014-09-24 16:34:00   MLK Library             11
## 80:  469674     2859 2014-09-25 16:42:00   MLK Library             11
## 81:  476380     3250 2014-09-30 16:34:00   MLK Library             11
## 82:  495778     3015 2014-10-13 11:16:00   MLK Library             11
## 83:  505517     9848 2014-10-19 14:56:00   MLK Library             11
## 84:  506277     1812 2014-10-20 09:16:00   MLK Library             11
## 85:  509993     1985 2014-10-22 09:37:00   MLK Library             11
## 86:  514887     3298 2014-10-24 21:06:00   MLK Library             11
## 87:  534363     2229 2014-11-07 09:55:00   MLK Library             11
## 88:  569828     1691 2014-12-08 13:47:00   MLK Library             11
## 89:  588367    34719 2014-12-31 01:08:00   MLK Library             11
## 90:  588366    34715 2014-12-31 01:08:00   MLK Library             11
## 91:  588368    34595 2014-12-31 01:10:00   MLK Library             11
##     trip_id duration          start_date start_station start_terminal
##                end_date                       end_station end_terminal
##  1: 2014-01-09 12:16:00                San Jose City Hall           10
##  2: 2014-01-20 09:52:00             San Jose Civic Center            3
##  3: 2014-01-20 19:06:00                       MLK Library           11
##  4: 2014-01-20 19:06:00                       MLK Library           11
##  5: 2014-01-26 15:54:00                       MLK Library           11
##  6: 2014-01-26 15:54:00                       MLK Library           11
##  7: 2014-01-31 02:24:00        SJSU - San Salvador at 9th           16
##  8: 2014-02-05 09:33:00              Paseo de San Antonio            7
##  9: 2014-02-12 10:23:00                       MLK Library           11
## 10: 2014-03-09 17:31:00                       MLK Library           11
## 11: 2014-03-10 09:21:00              Paseo de San Antonio            7
## 12: 2014-03-15 16:04:00             San Jose Civic Center            3
## 13: 2014-03-16 21:35:00        SJSU - San Salvador at 9th           16
## 14: 2014-03-16 21:35:00        SJSU - San Salvador at 9th           16
## 15: 2014-03-22 20:15:00                  Adobe on Almaden            5
## 16: 2014-03-22 20:15:00                  Adobe on Almaden            5
## 17: 2014-03-23 17:16:00                  Adobe on Almaden            5
## 18: 2014-03-23 17:16:00                  Adobe on Almaden            5
## 19: 2014-03-30 09:27:00                     St James Park           13
## 20: 2014-03-30 16:04:00                       MLK Library           11
## 21: 2014-04-02 09:27:00              Paseo de San Antonio            7
## 22: 2014-04-05 16:10:00            SJSU 4th at San Carlos           12
## 23: 2014-04-06 20:23:00                       MLK Library           11
## 24: 2014-04-07 11:31:00 San Jose Diridon Caltrain Station            2
## 25: 2014-04-10 22:46:00                       MLK Library           11
## 26: 2014-04-11 11:27:00        SJSU - San Salvador at 9th           16
## 27: 2014-04-11 15:36:00                       MLK Library           11
## 28: 2014-04-11 15:36:00                       MLK Library           11
## 29: 2014-04-11 16:16:00             San Jose Civic Center            3
## 30: 2014-04-11 16:16:00             San Jose Civic Center            3
## 31: 2014-04-12 20:55:00                       MLK Library           11
## 32: 2014-04-23 18:32:00   Santa Clara County Civic Center           80
## 33: 2014-04-28 09:17:00              Paseo de San Antonio            7
## 34: 2014-05-04 15:46:00             San Jose Civic Center            3
## 35: 2014-05-09 14:34:00                San Jose City Hall           10
## 36: 2014-05-07 20:15:00                       MLK Library           11
## 37: 2014-05-07 20:15:00                       MLK Library           11
## 38: 2014-05-08 07:26:00                San Jose City Hall           10
## 39: 2014-05-10 13:00:00                       MLK Library           11
## 40: 2014-05-15 20:18:00                  San Pedro Square            6
## 41: 2014-05-17 01:10:00                       MLK Library           11
## 42: 2014-05-29 14:37:00                       MLK Library           11
## 43: 2014-05-29 14:36:00                       MLK Library           11
## 44: 2014-05-31 12:41:00            SJSU 4th at San Carlos           12
## 45: 2014-05-31 13:58:00            SJSU 4th at San Carlos           12
## 46: 2014-06-09 01:36:00                  Adobe on Almaden            5
## 47: 2014-06-09 14:36:00        SJSU - San Salvador at 9th           16
## 48: 2014-06-09 14:36:00        SJSU - San Salvador at 9th           16
## 49: 2014-06-14 16:50:00 San Jose Diridon Caltrain Station            2
## 50: 2014-06-14 16:50:00 San Jose Diridon Caltrain Station            2
## 51: 2014-06-15 20:42:00                       MLK Library           11
## 52: 2014-06-15 20:43:00                       MLK Library           11
## 53: 2014-06-16 13:16:00                San Jose City Hall           10
## 54: 2014-06-18 15:30:00                       MLK Library           11
## 55: 2014-06-19 20:15:00        SJSU - San Salvador at 9th           16
## 56: 2014-06-19 20:15:00        SJSU - San Salvador at 9th           16
## 57: 2014-06-19 21:48:00        SJSU - San Salvador at 9th           16
## 58: 2014-06-24 08:16:00                       MLK Library           11
## 59: 2014-06-25 19:31:00             San Jose Civic Center            3
## 60: 2014-06-25 18:28:00             San Jose Civic Center            3
## 61: 2014-06-30 21:29:00               San Salvador at 1st            8
## 62: 2014-07-01 17:38:00 San Jose Diridon Caltrain Station            2
## 63: 2014-07-05 19:58:00                       MLK Library           11
## 64: 2014-07-19 00:22:00            SJSU 4th at San Carlos           12
## 65: 2014-08-01 21:25:00 San Jose Diridon Caltrain Station            2
## 66: 2014-08-03 14:34:00                       MLK Library           11
## 67: 2014-08-13 16:50:00                San Jose City Hall           10
## 68: 2014-08-13 16:50:00                San Jose City Hall           10
## 69: 2014-08-15 13:39:00                       MLK Library           11
## 70: 2014-08-15 13:39:00                       MLK Library           11
## 71: 2014-08-15 13:39:00                       MLK Library           11
## 72: 2014-08-17 00:07:00                       MLK Library           11
## 73: 2014-08-17 21:11:00              Paseo de San Antonio            7
## 74: 2014-08-28 11:31:00                       MLK Library           11
## 75: 2014-08-28 15:06:00                       MLK Library           11
## 76: 2014-08-28 15:06:00                       MLK Library           11
## 77: 2014-09-20 01:33:00                       MLK Library           11
## 78: 2014-09-20 13:43:00              Paseo de San Antonio            7
## 79: 2014-09-24 17:02:00                       MLK Library           11
## 80: 2014-09-25 17:29:00                San Jose City Hall           10
## 81: 2014-09-30 17:28:00                San Jose City Hall           10
## 82: 2014-10-13 12:07:00                         Japantown            9
## 83: 2014-10-19 17:40:00                       MLK Library           11
## 84: 2014-10-20 09:46:00                         Japantown            9
## 85: 2014-10-22 10:10:00                         Japantown            9
## 86: 2014-10-24 22:01:00        SJSU - San Salvador at 9th           16
## 87: 2014-11-07 10:32:00                         Japantown            9
## 88: 2014-12-08 14:15:00                         Japantown            9
## 89: 2014-12-31 10:47:00                       MLK Library           11
## 90: 2014-12-31 10:47:00                       MLK Library           11
## 91: 2014-12-31 10:47:00                       MLK Library           11
##                end_date                       end_station end_terminal
##     bike_id subscription_type zip_code
##  1:     691        Subscriber    95112
##  2:      86          Customer    95608
##  3:     175          Customer    95166
##  4:     262          Customer    95166
##  5:      11          Customer    95117
##  6:     308          Customer    95117
##  7:     307          Customer    94510
##  8:     249        Subscriber    95126
##  9:     140        Subscriber    95112
## 10:     133          Customer    95112
## 11:     133        Subscriber    95126
## 12:      59          Customer    92646
## 13:     690          Customer    90278
## 14:     666          Customer    95112
## 15:     132          Customer    95122
## 16:     176          Customer    95122
## 17:     140          Customer     <NA>
## 18:     668          Customer    95112
## 19:     175          Customer    60613
## 20:     199          Customer     <NA>
## 21:     154        Subscriber    95126
## 22:      62        Subscriber    95126
## 23:     158          Customer    95112
## 24:     199        Subscriber    95112
## 25:     168          Customer     9539
## 26:      10          Customer    95131
## 27:     112          Customer     <NA>
## 28:     187          Customer     <NA>
## 29:     176          Customer     <NA>
## 30:     112          Customer     <NA>
## 31:     247          Customer    95112
## 32:     716          Customer    95054
## 33:     130        Subscriber    95126
## 34:     132          Customer    95123
## 35:      39          Customer     6473
## 36:     164          Customer     <NA>
## 37:     109          Customer     <NA>
## 38:     257        Subscriber    94041
## 39:     109          Customer    95125
## 40:     658          Customer    94089
## 41:     691          Customer   195127
## 42:     227          Customer    95112
## 43:      15          Customer    95112
## 44:     122          Customer    95476
## 45:     227          Customer    95476
## 46:     711          Customer    80526
## 47:      30          Customer    95116
## 48:     227          Customer    95116
## 49:     150          Customer    95112
## 50:     164          Customer    95112
## 51:     715          Customer    95123
## 52:     646          Customer    95123
## 53:     709        Subscriber    94010
## 54:      45          Customer     3102
## 55:     258          Customer    94704
## 56:     646          Customer    94708
## 57:     646          Customer    94708
## 58:      96          Customer    28801
## 59:      62          Customer     <NA>
## 60:     711          Customer    95113
## 61:     128          Customer    95133
## 62:     188        Subscriber    94002
## 63:      81          Customer     <NA>
## 64:     699          Customer    95037
## 65:      81          Customer    94606
## 66:      41          Customer    95112
## 67:     255          Customer    95110
## 68:     648          Customer    95110
## 69:     186          Customer    95139
## 70:     223          Customer    95139
## 71:     256          Customer    95139
## 72:      23        Subscriber    95112
## 73:     678          Customer     <NA>
## 74:     128        Subscriber    95112
## 75:      11          Customer    90621
## 76:      93          Customer    90621
## 77:     294          Customer    95116
## 78:     298          Customer    92105
## 79:     305          Customer    95112
## 80:     197          Customer    95112
## 81:     161          Customer    95112
## 82:     205        Subscriber    95122
## 83:      30          Customer   889503
## 84:     107        Subscriber    95122
## 85:     301        Subscriber    95122
## 86:     706        Subscriber    95122
## 87:      51        Subscriber    95122
## 88:     253        Subscriber    95122
## 89:     128          Customer    95112
## 90:     667          Customer    95112
## 91:     241          Customer    95076
##     bike_id subscription_type zip_code
# Filter all rows where `subscription_type` is not `"Subscriber"`
customers <- batrips[subscription_type != "Subscriber"]
customers
##        trip_id duration          start_date            start_station
##     1:  139549     1620 2014-01-01 00:23:00        Steuart at Market
##     2:  139550     1617 2014-01-01 00:23:00        Steuart at Market
##     3:  139551      779 2014-01-01 00:24:00        Steuart at Market
##     4:  139552      784 2014-01-01 00:24:00        Steuart at Market
##     5:  139553      721 2014-01-01 00:25:00        Steuart at Market
##    ---                                                              
## 48572:  588900     1428 2014-12-31 22:06:00    Embarcadero at Bryant
## 48573:  588909      992 2014-12-31 23:06:00     Washington at Kearny
## 48574:  588908     1004 2014-12-31 23:06:00     Washington at Kearny
## 48575:  588912     1487 2014-12-31 23:31:00 South Van Ness at Market
## 48576:  588913     1458 2014-12-31 23:32:00 South Van Ness at Market
##        start_terminal            end_date            end_station
##     1:             74 2014-01-01 00:50:00     Powell Street BART
##     2:             74 2014-01-01 00:50:00     Powell Street BART
##     3:             74 2014-01-01 00:37:00  Washington at Kearney
##     4:             74 2014-01-01 00:37:00  Washington at Kearney
##     5:             74 2014-01-01 00:37:00  Washington at Kearney
##    ---                                                          
## 48572:             54 2014-12-31 22:30:00  Embarcadero at Folsom
## 48573:             46 2014-12-31 23:23:00 Embarcadero at Vallejo
## 48574:             46 2014-12-31 23:23:00 Embarcadero at Vallejo
## 48575:             66 2014-12-31 23:56:00      Steuart at Market
## 48576:             66 2014-12-31 23:56:00      Steuart at Market
##        end_terminal bike_id subscription_type zip_code
##     1:           39     605          Customer    92007
##     2:           39     453          Customer    92007
##     3:           46     335          Customer    94109
##     4:           46     580          Customer         
##     5:           46     563          Customer    94109
##    ---                                                
## 48572:           51     502          Customer    94587
## 48573:           48     485          Customer    92104
## 48574:           48     419          Customer    92104
## 48575:           74     480          Customer    94109
## 48576:           74     277          Customer    94109
# Filter all rows where start_station is "Ryland Park" AND subscription_type is not "Customer"
ryland_park_subscribers <- batrips[start_station=="Ryland Park" & subscription_type != "Customer"]
ryland_park_subscribers
##      trip_id duration          start_date start_station start_terminal
##   1:  243456      330 2014-04-10 09:10:00   Ryland Park             84
##   2:  244497      594 2014-04-11 07:28:00   Ryland Park             84
##   3:  245067      265 2014-04-11 13:38:00   Ryland Park             84
##   4:  246212      522 2014-04-12 17:25:00   Ryland Park             84
##   5:  246933      493 2014-04-14 07:30:00   Ryland Park             84
##  ---                                                                  
## 936:  587254      493 2014-12-29 13:27:00   Ryland Park             84
## 937:  587827      619 2014-12-30 08:35:00   Ryland Park             84
## 938:  588063      489 2014-12-30 13:29:00   Ryland Park             84
## 939:  588442      423 2014-12-31 07:48:00   Ryland Park             84
## 940:  588447      536 2014-12-31 07:56:00   Ryland Park             84
##                 end_date                       end_station end_terminal
##   1: 2014-04-10 09:16:00                         Japantown            9
##   2: 2014-04-11 07:38:00 San Jose Diridon Caltrain Station            2
##   3: 2014-04-11 13:43:00                  San Pedro Square            6
##   4: 2014-04-12 17:34:00               San Salvador at 1st            8
##   5: 2014-04-14 07:38:00 San Jose Diridon Caltrain Station            2
##  ---                                                                   
## 936: 2014-12-29 13:36:00   Santa Clara County Civic Center           80
## 937: 2014-12-30 08:45:00   Santa Clara County Civic Center           80
## 938: 2014-12-30 13:37:00   Santa Clara County Civic Center           80
## 939: 2014-12-31 07:55:00             San Jose Civic Center            3
## 940: 2014-12-31 08:05:00   Santa Clara County Civic Center           80
##      bike_id subscription_type zip_code
##   1:      23        Subscriber    95110
##   2:      54        Subscriber    95110
##   3:      62        Subscriber    95110
##   4:     144        Subscriber    95110
##   5:     643        Subscriber    95110
##  ---                                   
## 936:     126        Subscriber    95112
## 937:     126        Subscriber    95112
## 938:     126        Subscriber    95112
## 939:     710        Subscriber     5112
## 940:      75        Subscriber    95112
# Filter all rows where end_station contains "Market"
any_markets <- batrips[end_station %like% "Market"]
any_markets
##        trip_id duration          start_date
##     1:  139547     1523 2014-01-01 00:17:00
##     2:  139558     1600 2014-01-01 00:28:00
##     3:  139559     3691 2014-01-01 00:32:00
##     4:  139560     3793 2014-01-01 00:32:00
##     5:  139561     3788 2014-01-01 00:32:00
##    ---                                     
## 69120:  588882     1714 2014-12-31 19:14:00
## 69121:  588888      726 2014-12-31 20:05:00
## 69122:  588901      531 2014-12-31 22:08:00
## 69123:  588912     1487 2014-12-31 23:31:00
## 69124:  588913     1458 2014-12-31 23:32:00
##                               start_station start_terminal
##     1:               Embarcadero at Sansome             60
##     2: Harry Bridges Plaza (Ferry Building)             50
##     3:                    Steuart at Market             74
##     4:                    Steuart at Market             74
##     5:                    Steuart at Market             74
##    ---                                                    
## 69120:               Embarcadero at Sansome             60
## 69121:                   Powell Street BART             39
## 69122:               Embarcadero at Sansome             60
## 69123:             South Van Ness at Market             66
## 69124:             South Van Ness at Market             66
##                   end_date                       end_station end_terminal
##     1: 2014-01-01 00:42:00                   Beale at Market           56
##     2: 2014-01-01 00:54:00                 Steuart at Market           74
##     3: 2014-01-01 01:33:00                 Steuart at Market           74
##     4: 2014-01-01 01:35:00                 Steuart at Market           74
##     5: 2014-01-01 01:35:00                 Steuart at Market           74
##    ---                                                                   
## 69120: 2014-12-31 19:43:00                     Market at 4th           76
## 69121: 2014-12-31 20:17:00 Civic Center BART (7th at Market)           72
## 69122: 2014-12-31 22:17:00                 Steuart at Market           74
## 69123: 2014-12-31 23:56:00                 Steuart at Market           74
## 69124: 2014-12-31 23:56:00                 Steuart at Market           74
##        bike_id subscription_type zip_code
##     1:     331        Subscriber    94112
##     2:     413        Subscriber    94102
##     3:     619          Customer    94070
##     4:     311          Customer    55417
##     5:     577          Customer    55417
##    ---                                   
## 69120:     328          Customer     <NA>
## 69121:     475        Subscriber    94112
## 69122:     360        Subscriber    94107
## 69123:     480          Customer    94109
## 69124:     277          Customer    94109
# Filter all rows where end_station ends with "Market" 
end_markets <- batrips[end_station %like% "Market$"]
end_markets
##        trip_id duration          start_date
##     1:  139547     1523 2014-01-01 00:17:00
##     2:  139558     1600 2014-01-01 00:28:00
##     3:  139559     3691 2014-01-01 00:32:00
##     4:  139560     3793 2014-01-01 00:32:00
##     5:  139561     3788 2014-01-01 00:32:00
##    ---                                     
## 23352:  588841      308 2014-12-31 17:27:00
## 23353:  588855      229 2014-12-31 17:51:00
## 23354:  588901      531 2014-12-31 22:08:00
## 23355:  588912     1487 2014-12-31 23:31:00
## 23356:  588913     1458 2014-12-31 23:32:00
##                               start_station start_terminal
##     1:               Embarcadero at Sansome             60
##     2: Harry Bridges Plaza (Ferry Building)             50
##     3:                    Steuart at Market             74
##     4:                    Steuart at Market             74
##     5:                    Steuart at Market             74
##    ---                                                    
## 23352:               Embarcadero at Sansome             60
## 23353:            Broadway St at Battery St             82
## 23354:               Embarcadero at Sansome             60
## 23355:             South Van Ness at Market             66
## 23356:             South Van Ness at Market             66
##                   end_date       end_station end_terminal bike_id
##     1: 2014-01-01 00:42:00   Beale at Market           56     331
##     2: 2014-01-01 00:54:00 Steuart at Market           74     413
##     3: 2014-01-01 01:33:00 Steuart at Market           74     619
##     4: 2014-01-01 01:35:00 Steuart at Market           74     311
##     5: 2014-01-01 01:35:00 Steuart at Market           74     577
##    ---                                                           
## 23352: 2014-12-31 17:32:00 Steuart at Market           74     395
## 23353: 2014-12-31 17:55:00   Beale at Market           56      86
## 23354: 2014-12-31 22:17:00 Steuart at Market           74     360
## 23355: 2014-12-31 23:56:00 Steuart at Market           74     480
## 23356: 2014-12-31 23:56:00 Steuart at Market           74     277
##        subscription_type zip_code
##     1:        Subscriber    94112
##     2:        Subscriber    94102
##     3:          Customer    94070
##     4:          Customer    55417
##     5:          Customer    55417
##    ---                           
## 23352:        Subscriber    94102
## 23353:        Subscriber    94131
## 23354:        Subscriber    94107
## 23355:          Customer    94109
## 23356:          Customer    94109
# Filter all rows where trip_id is 588841, 139560, or 139562
filter_trip_ids <- batrips[trip_id %in% c(588841, 139560, 139562)]
filter_trip_ids
##    trip_id duration          start_date          start_station
## 1:  139560     3793 2014-01-01 00:32:00      Steuart at Market
## 2:  139562     3626 2014-01-01 00:33:00      Steuart at Market
## 3:  588841      308 2014-12-31 17:27:00 Embarcadero at Sansome
##    start_terminal            end_date       end_station end_terminal
## 1:             74 2014-01-01 01:35:00 Steuart at Market           74
## 2:             74 2014-01-01 01:33:00 Steuart at Market           74
## 3:             60 2014-12-31 17:32:00 Steuart at Market           74
##    bike_id subscription_type zip_code
## 1:     311          Customer    55417
## 2:     271          Customer    94070
## 3:     395        Subscriber    94102
# Filter all rows where duration is between [5000, 6000]
duration_5k_6k <- batrips[duration %between% c(5000, 6000)]
duration_5k_6k
##       trip_id duration          start_date                   start_station
##    1:  139607     5987 2014-01-01 07:57:00               Market at Sansome
##    2:  139608     5974 2014-01-01 07:57:00               Market at Sansome
##    3:  139663     5114 2014-01-01 11:29:00           Embarcadero at Bryant
##    4:  139664     5040 2014-01-01 11:30:00           Embarcadero at Bryant
##    5:  139887     5023 2014-01-01 16:08:00                Davis at Jackson
##   ---                                                                     
## 1042:  586297     5425 2014-12-27 14:02:00        South Van Ness at Market
## 1043:  586673     5849 2014-12-28 14:50:00          University and Emerson
## 1044:  586707     5162 2014-12-28 15:50:00     San Antonio Shopping Center
## 1045:  586795     5886 2014-12-28 20:05:00   Powell at Post (Union Square)
## 1046:  588370     5850 2014-12-31 02:46:00 Grant Avenue at Columbus Avenue
##       start_terminal            end_date                     end_station
##    1:             77 2014-01-01 09:37:00 Grant Avenue at Columbus Avenue
##    2:             77 2014-01-01 09:37:00 Grant Avenue at Columbus Avenue
##    3:             54 2014-01-01 12:55:00           Embarcadero at Bryant
##    4:             54 2014-01-01 12:54:00           Embarcadero at Bryant
##    5:             42 2014-01-01 17:31:00 Grant Avenue at Columbus Avenue
##   ---                                                                   
## 1042:             66 2014-12-27 15:32:00        South Van Ness at Market
## 1043:             35 2014-12-28 16:27:00          University and Emerson
## 1044:             31 2014-12-28 17:16:00    San Antonio Caltrain Station
## 1045:             71 2014-12-28 21:43:00   Powell at Post (Union Square)
## 1046:             73 2014-12-31 04:23:00   Powell at Post (Union Square)
##       end_terminal bike_id subscription_type zip_code
##    1:           73     591          Customer    75201
##    2:           73     596          Customer    75201
##    3:           54     604          Customer    94513
##    4:           54     281          Customer    94513
##    5:           73     329          Customer    89117
##   ---                                                
## 1042:           66     314          Customer    94306
## 1043:           35     254          Customer        1
## 1044:           29     680          Customer    95616
## 1045:           71     394          Customer    60616
## 1046:           71     401          Customer     <NA>
# Filter all rows with specific start stations
two_stations <- batrips[start_station %chin% c("San Francisco City Hall", "Embarcadero at Sansome")]
two_stations
##        trip_id duration          start_date           start_station
##     1:  139545      435 2014-01-01 00:14:00 San Francisco City Hall
##     2:  139546      432 2014-01-01 00:14:00 San Francisco City Hall
##     3:  139547     1523 2014-01-01 00:17:00  Embarcadero at Sansome
##     4:  139583      903 2014-01-01 01:34:00  Embarcadero at Sansome
##     5:  139584      873 2014-01-01 01:35:00  Embarcadero at Sansome
##    ---                                                             
## 15020:  588881     1800 2014-12-31 19:12:00  Embarcadero at Sansome
## 15021:  588882     1714 2014-12-31 19:14:00  Embarcadero at Sansome
## 15022:  588887      274 2014-12-31 20:02:00  Embarcadero at Sansome
## 15023:  588898      979 2014-12-31 22:06:00 San Francisco City Hall
## 15024:  588901      531 2014-12-31 22:08:00  Embarcadero at Sansome
##        start_terminal            end_date
##     1:             58 2014-01-01 00:21:00
##     2:             58 2014-01-01 00:21:00
##     3:             60 2014-01-01 00:42:00
##     4:             60 2014-01-01 01:49:00
##     5:             60 2014-01-01 01:49:00
##    ---                                   
## 15020:             60 2014-12-31 19:42:00
## 15021:             60 2014-12-31 19:43:00
## 15022:             60 2014-12-31 20:07:00
## 15023:             58 2014-12-31 22:22:00
## 15024:             60 2014-12-31 22:17:00
##                                     end_station end_terminal bike_id
##     1:                          Townsend at 7th           65     473
##     2:                          Townsend at 7th           65     395
##     3:                          Beale at Market           56     331
##     4: San Francisco Caltrain (Townsend at 4th)           70     278
##     5: San Francisco Caltrain (Townsend at 4th)           70     336
##    ---                                                              
## 15020:                            Market at 4th           76     313
## 15021:                            Market at 4th           76     328
## 15022:     Harry Bridges Plaza (Ferry Building)           50     358
## 15023:                          Clay at Battery           41     385
## 15024:                        Steuart at Market           74     360
##        subscription_type zip_code
##     1:        Subscriber    94612
##     2:        Subscriber    94107
##     3:        Subscriber    94112
##     4:          Customer     9406
##     5:          Customer     9406
##    ---                           
## 15020:          Customer     <NA>
## 15021:          Customer     <NA>
## 15022:        Subscriber    94111
## 15023:        Subscriber    94111
## 15024:        Subscriber    94107

Chapter 2 - Selecting and Computing on Columns

Selecting columns from a data.table:

  • The data.table syntax consists of DT[i, j, by] # which rows, what to do (operates on columns), grouped by what
  • The j argument can be passed a vector of column names or column numbers
    • ans <- batrips[, “trip_id”] # will still be a data.table (no need for drop=FALSE)
    • ans <- batrips[, c(2, 4)] # not recommended, since column numbers may change over time
    • ans <- batrips[, -c(“start_date”, “end_date”, “end_station”)] # excludes these columns
  • The data.table way allows for computations on columns as part of the j column
    • ans <- batrips[, list(trip_id, dur = duration)] # duration will be renamed to dur while selecting
    • ans <- batrips[, list(trip_id)] # will return a data.table since it is inside ()
  • The .() is an alias to list()
    • ans <- batrips[, .(trip_id, duration)] # select columns trip_id, duration
    • ans <- batrips[, list(trip_id, duration)] # same as above
  • To get the return of a vector (like drop=TRUE in data.frame), pass the column name as an unquoted variable, not inside list() and not inside .()
    • batrips[, duration] # will return a vector

Computing on columns the data.table way:

  • Can perform computations directly on columns in j
    • ans <- batrips[, mean(duration)] # will return a single value, the mean of duration (will be a vector)
  • Can run computations for only a subset of rows by combining I and j
    • batrips[start_station == “Japantown”, mean(duration)] # row subsetting happens PRIOR to mean() calculation
  • Can use .N as part of the j argument
    • batrips[start_station == “Japantown”, .N] # number of trips starting from Japantown

Advanced computations in j:

  • Can compute in j and return a data.table
    • ans <- batrips[, .(trip_id, dur = duration)]
    • batrips[, .(mn_dur = mean(duration), med_dur = median(duration))]
  • Can combine with I so that only a subset of rows have the calculations in j selected on them
    • batrips[start_station == “Japantown”, .(mn_dur = mean(duration), med_dur = median(duration))]

Example code includes:

# Select bike_id and trip_id using a character vector
df_way <- batrips[, c("bike_id", "trip_id")]
df_way
##         bike_id trip_id
##      1:     473  139545
##      2:     395  139546
##      3:     331  139547
##      4:     605  139549
##      5:     453  139550
##     ---                
## 326335:     573  588910
## 326336:     604  588911
## 326337:     480  588912
## 326338:     277  588913
## 326339:      56  588914
# Select start_station and end_station cols without a character vector
dt_way <- batrips[, .(start_station, end_station)]
dt_way
##                           start_station
##      1:         San Francisco City Hall
##      2:         San Francisco City Hall
##      3:          Embarcadero at Sansome
##      4:               Steuart at Market
##      5:               Steuart at Market
##     ---                                
## 326335:              Powell Street BART
## 326336: Grant Avenue at Columbus Avenue
## 326337:        South Van Ness at Market
## 326338:        South Van Ness at Market
## 326339:           Embarcadero at Bryant
##                                           end_station
##      1:                               Townsend at 7th
##      2:                               Townsend at 7th
##      3:                               Beale at Market
##      4:                            Powell Street BART
##      5:                            Powell Street BART
##     ---                                              
## 326335:      San Francisco Caltrain (Townsend at 4th)
## 326336: Yerba Buena Center of the Arts (3rd @ Howard)
## 326337:                             Steuart at Market
## 326338:                             Steuart at Market
## 326339:                                 Howard at 2nd
# You can also drop or deselect columns by prepending the character vector of column names with the - or ! Operators
# For e.g., dt[, -c("col1", "col2")] or dt[, !c("col1", "col2")] would both return all columns except col1 and col2

# Deselect start_terminal and end_terminal columns
drop_terminal_cols <- batrips[, -c("start_terminal", "end_terminal")]
drop_terminal_cols
##         trip_id duration          start_date
##      1:  139545      435 2014-01-01 00:14:00
##      2:  139546      432 2014-01-01 00:14:00
##      3:  139547     1523 2014-01-01 00:17:00
##      4:  139549     1620 2014-01-01 00:23:00
##      5:  139550     1617 2014-01-01 00:23:00
##     ---                                     
## 326335:  588910      437 2014-12-31 23:18:00
## 326336:  588911      422 2014-12-31 23:19:00
## 326337:  588912     1487 2014-12-31 23:31:00
## 326338:  588913     1458 2014-12-31 23:32:00
## 326339:  588914      364 2014-12-31 23:33:00
##                           start_station            end_date
##      1:         San Francisco City Hall 2014-01-01 00:21:00
##      2:         San Francisco City Hall 2014-01-01 00:21:00
##      3:          Embarcadero at Sansome 2014-01-01 00:42:00
##      4:               Steuart at Market 2014-01-01 00:50:00
##      5:               Steuart at Market 2014-01-01 00:50:00
##     ---                                                    
## 326335:              Powell Street BART 2014-12-31 23:25:00
## 326336: Grant Avenue at Columbus Avenue 2014-12-31 23:26:00
## 326337:        South Van Ness at Market 2014-12-31 23:56:00
## 326338:        South Van Ness at Market 2014-12-31 23:56:00
## 326339:           Embarcadero at Bryant 2014-12-31 23:40:00
##                                           end_station bike_id
##      1:                               Townsend at 7th     473
##      2:                               Townsend at 7th     395
##      3:                               Beale at Market     331
##      4:                            Powell Street BART     605
##      5:                            Powell Street BART     453
##     ---                                                      
## 326335:      San Francisco Caltrain (Townsend at 4th)     573
## 326336: Yerba Buena Center of the Arts (3rd @ Howard)     604
## 326337:                             Steuart at Market     480
## 326338:                             Steuart at Market     277
## 326339:                                 Howard at 2nd      56
##         subscription_type zip_code
##      1:        Subscriber    94612
##      2:        Subscriber    94107
##      3:        Subscriber    94112
##      4:          Customer    92007
##      5:          Customer    92007
##     ---                           
## 326335:        Subscriber    95050
## 326336:        Subscriber    94133
## 326337:          Customer    94109
## 326338:          Customer    94109
## 326339:        Subscriber    94105
# Calculate median duration using the j argument
median_duration <- batrips[, mean(duration)]
median_duration
## [1] 1132
# Get median duration after filtering
median_duration_filter <- batrips[end_station == "Market at 10th" & subscription_type == "Subscriber", median(duration)]
median_duration_filter
## [1] 651
# Compute duration of all trips
trip_duration <- batrips[, difftime(end_date, start_date, units="min")]
head(trip_duration)
## Time differences in mins
## [1]  7  7 25 27 27 13
# Have the column mean_durn
mean_duration <- batrips[, .(mean_durn=mean(duration))]
mean_duration
##    mean_durn
## 1:      1132
# Get the min and max duration values
min_max_duration <- batrips[, .(min(duration), max(duration))]
min_max_duration
##    V1       V2
## 1: 60 17270400
# Calculate the number of unique values
other_stats <- batrips[, .(mean_duration=mean(duration), last_ride=max(end_date))]
other_stats
##    mean_duration           last_ride
## 1:          1132 2015-06-24 20:18:00
duration_stats <- batrips[start_station == "Townsend at 7th" & duration < 500, 
                          .(min_dur=min(duration), max_dur=max(duration))]
duration_stats
##    min_dur max_dur
## 1:      62     499
# Plot the histogram of duration based on conditions
batrips[start_station == "Townsend at 7th" & duration < 500, hist(duration)]

## $breaks
##  [1]  50 100 150 200 250 300 350 400 450 500
## 
## $counts
## [1]   28   15  792 2042  920  314  314  497  538
## 
## $density
## [1] 1.03e-04 5.49e-05 2.90e-03 7.48e-03 3.37e-03 1.15e-03 1.15e-03 1.82e-03
## [9] 1.97e-03
## 
## $mids
## [1]  75 125 175 225 275 325 375 425 475
## 
## $xname
## [1] "duration"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Chapter 3 - Groupwise Operations

Computations by Groups:

  • The by argument allows for separate computations by each group number
    • ans <- batrips[, .N, by = “start_station”]
    • ans <- batrips[, .N, by = .(start_station)] # can use .() or list() instead of the character vector
  • Can rename columns on the fly, including calculations and group by
    • ans <- batrips[, .(no_trips = .N), by = .(start = start_station)]
  • Can calculate grouping variables on the fly as part of the by expression
    • ans <- batrips[ , .N, by = .(start_station, mon = month(start_date))]

Chaining data.table expressions:

  • Chaining expressions is the process of running multiple operations to get a single output
    • batrips[duration > 360][order(duration)][1:3]
    • batrips[, .(mn_dur = mean(duration)), by = “start_station”][order(mn_dur)][1:3]
  • The uniqueN() is a helper function that returns the number of unique objects
    • ans <- batrips[, uniqueN(bike_id), by = month(start_date)]

Computations in j using .SD:

  • The .SD means “subset of data” which simplifies calculations
    • x[, print(.SD), by = id] # each of the groups is a data.table, which is the power of the .SD helper
    • x[, .SD[1], by = id] # returns the first row for each group
  • Can use .SDcols to select just a subset of the columns for return
    • batrips[, .SD[1], by = start_station]
    • batrips[, .SD[1], by = start_station, .SDcols = c(“trip_id”, “duration”)]
    • batrips[, .SD[1], by = start_station, .SDcols = - c(“trip_id”, “duration”)]

Example code includes:

# Compute the mean duration for every start_station
mean_start_stn <- batrips[, .(mean_duration=mean(duration)), by = "start_station"]
mean_start_stn
##                                     start_station mean_duration
##  1:                       San Francisco City Hall          1894
##  2:                        Embarcadero at Sansome          1418
##  3:                             Steuart at Market           957
##  4:                                 5th at Howard           845
##  5:          Harry Bridges Plaza (Ferry Building)          1516
##  6:                               Beale at Market           857
##  7:                         Embarcadero at Folsom           854
##  8:                             2nd at South Park           698
##  9:                        Santa Clara at Almaden           954
## 10:                            Powell Street BART          1332
## 11:                                 Howard at 2nd           740
## 12:                               2nd at Townsend           841
## 13:                      South Van Ness at Market          3801
## 14:                                 2nd at Folsom           551
## 15:                                 Market at 4th          1272
## 16:                                Market at 10th          1073
## 17:                             Market at Sansome           829
## 18:                         Embarcadero at Bryant           993
## 19: Temporary Transbay Terminal (Howard at Beale)           656
## 20:             Civic Center BART (7th at Market)          1288
## 21:       San Francisco Caltrain 2 (330 Townsend)           702
## 22:               Grant Avenue at Columbus Avenue          1245
## 23:                          Paseo de San Antonio          1813
## 24:                         San Jose Civic Center          3554
## 25:                        University and Emerson          5435
## 26:                               Townsend at 7th           701
## 27:                        Embarcadero at Vallejo          1646
## 28:                         Washington at Kearney          2335
## 29:                               Spear at Folsom           740
## 30:      San Francisco Caltrain (Townsend at 4th)           819
## 31:                              Davis at Jackson           936
## 32:                               Clay at Battery          1200
## 33:                           Golden Gate at Polk          1196
## 34: Yerba Buena Center of the Arts (3rd @ Howard)           871
## 35:                 Powell at Post (Union Square)          1608
## 36:                  San Antonio Caltrain Station          2820
## 37:         Rengstorff Avenue / California Street          4363
## 38:                          Cowper at University          2190
## 39:           Mechanics Plaza (Market at Battery)          1007
## 40:                Mountain View Caltrain Station          1268
## 41:                              Adobe on Almaden           845
## 42:                      Commercial at Montgomery           796
## 43:                    SJSU - San Salvador at 9th           937
## 44:                               Post at Kearney           937
## 45:               California Ave Caltrain Station          4294
## 46:                                 St James Park           938
## 47:                       Mountain View City Hall          1558
## 48:                           San Salvador at 1st          1101
## 49:                          Evelyn Park and Ride          1212
## 50:             San Jose Diridon Caltrain Station           857
## 51:                 Redwood City Caltrain Station          4221
## 52:                    Palo Alto Caltrain Station          3220
## 53:                            San Jose City Hall          1001
## 54:                        SJSU 4th at San Carlos          2096
## 55:                                 Park at Olive          3705
## 56:                      Arena Green / SAP Center          2050
## 57:                              San Pedro Square           969
## 58:                                   MLK Library          1885
## 59:                                     Japantown          2464
## 60:                              Broadway at Main          3473
## 61:                    San Jose Government Center          1068
## 62:              Castro Street and El Camino Real          1831
## 63:                       San Mateo County Center          4046
## 64:                   San Antonio Shopping Center          1373
## 65:                             Franklin at Maple           803
## 66:                   Redwood City Medical Center          1912
## 67:                   Redwood City Public Library          3474
## 68:                     Broadway St at Battery St           883
## 69:                                    Mezes Park           771
## 70:                          Washington at Kearny          1886
## 71:                                Post at Kearny          1035
## 72:               Santa Clara County Civic Center          1379
## 73:                                   Ryland Park          1407
## 74:                      Stanford in Redwood City          1878
##                                     start_station mean_duration
# Compute the mean duration for every start and end station
mean_station <- batrips[, .(mean_duration=mean(duration)), by = .(start_station, end_station)]
mean_station
##                     start_station
##    1:     San Francisco City Hall
##    2:      Embarcadero at Sansome
##    3:           Steuart at Market
##    4:           Steuart at Market
##    5:               5th at Howard
##   ---                            
## 1859:                  Mezes Park
## 1860: Redwood City Public Library
## 1861:      University and Emerson
## 1862:           Franklin at Maple
## 1863:      University and Emerson
##                                         end_station mean_duration
##    1:                               Townsend at 7th           679
##    2:                               Beale at Market           651
##    3:                            Powell Street BART           884
##    4:                         Washington at Kearney          1553
##    5: Yerba Buena Center of the Arts (3rd @ Howard)          1811
##   ---                                                            
## 1859:                    Palo Alto Caltrain Station          2518
## 1860:                      Stanford in Redwood City           639
## 1861:                   Redwood City Public Library          1395
## 1862:                          Cowper at University          2654
## 1863:                       San Mateo County Center          1918
# Compute the mean duration grouped by start_station and month
mean_start_station <- batrips[, .(mean_duration=mean(duration)), by=.(start_station, month(start_date))]
mean_start_station
##                             start_station month mean_duration
##   1:              San Francisco City Hall     1          1548
##   2:               Embarcadero at Sansome     1           952
##   3:                    Steuart at Market     1           757
##   4:                        5th at Howard     1           599
##   5: Harry Bridges Plaza (Ferry Building)     1          1429
##  ---                                                         
## 830:      California Ave Caltrain Station    12          4230
## 831:               University and Emerson    12          7771
## 832:           SJSU - San Salvador at 9th    12           653
## 833:              San Mateo County Center    12          5034
## 834:          Redwood City Public Library    12           496
# Compute mean of duration and total trips grouped by start and end stations
aggregate_mean_trips <- batrips[, .(mean_duration=mean(duration), total_trips=.N), by=.(start_station, end_station)]
aggregate_mean_trips
##                     start_station
##    1:     San Francisco City Hall
##    2:      Embarcadero at Sansome
##    3:           Steuart at Market
##    4:           Steuart at Market
##    5:               5th at Howard
##   ---                            
## 1859:                  Mezes Park
## 1860: Redwood City Public Library
## 1861:      University and Emerson
## 1862:           Franklin at Maple
## 1863:      University and Emerson
##                                         end_station mean_duration
##    1:                               Townsend at 7th           679
##    2:                               Beale at Market           651
##    3:                            Powell Street BART           884
##    4:                         Washington at Kearney          1553
##    5: Yerba Buena Center of the Arts (3rd @ Howard)          1811
##   ---                                                            
## 1859:                    Palo Alto Caltrain Station          2518
## 1860:                      Stanford in Redwood City           639
## 1861:                   Redwood City Public Library          1395
## 1862:                          Cowper at University          2654
## 1863:                       San Mateo County Center          1918
##       total_trips
##    1:         121
##    2:         545
##    3:         145
##    4:           9
##    5:          59
##   ---            
## 1859:           2
## 1860:           1
## 1861:           1
## 1862:           1
## 1863:           2
# Compute min and max duration grouped by start station, end station, and month
aggregate_min_max <- batrips[, .(min_duration=min(duration), max_duration=max(duration)), by=.(start_station, end_station, month(start_date))]
aggregate_min_max
##                            start_station
##     1:           San Francisco City Hall
##     2:            Embarcadero at Sansome
##     3:                 Steuart at Market
##     4:                 Steuart at Market
##     5:                     5th at Howard
##    ---                                  
## 17665: Civic Center BART (7th at Market)
## 17666:         Broadway St at Battery St
## 17667:                         Japantown
## 17668:          Arena Green / SAP Center
## 17669:              Washington at Kearny
##                                          end_station month min_duration
##     1:                               Townsend at 7th     1          370
##     2:                               Beale at Market     1          345
##     3:                            Powell Street BART     1          498
##     4:                         Washington at Kearney     1          312
##     5: Yerba Buena Center of the Arts (3rd @ Howard)     1          349
##    ---                                                                 
## 17665:                        Embarcadero at Vallejo    12          870
## 17666:                          Washington at Kearny    12          158
## 17667:                                     Japantown    12         3237
## 17668:                                   Ryland Park    12          767
## 17669:                                 Howard at 2nd    12          915
##        max_duration
##     1:          661
##     2:         1674
##     3:         1620
##     4:          784
##     5:          624
##    ---             
## 17665:          870
## 17666:          158
## 17667:         3249
## 17668:          767
## 17669:          965
# Arrange the total trips grouped by start_station and end_station in decreasing order
trips_dec <- batrips[, .N, by = .(start_station, end_station)][order(-N)]
trips_dec
##                                 start_station
##    1:                         Townsend at 7th
##    2: San Francisco Caltrain 2 (330 Townsend)
##    3:    Harry Bridges Plaza (Ferry Building)
##    4:                         2nd at Townsend
##    5:                          Market at 10th
##   ---                                        
## 1859:           Redwood City Caltrain Station
## 1860:                              Mezes Park
## 1861:             Redwood City Public Library
## 1862:                  University and Emerson
## 1863:                       Franklin at Maple
##                                    end_station    N
##    1: San Francisco Caltrain (Townsend at 4th) 3158
##    2:                          Townsend at 7th 2937
##    3:                   Embarcadero at Sansome 2826
##    4:     Harry Bridges Plaza (Ferry Building) 2330
##    5: San Francisco Caltrain (Townsend at 4th) 2131
##   ---                                              
## 1859:          California Ave Caltrain Station    1
## 1860:          California Ave Caltrain Station    1
## 1861:                 Stanford in Redwood City    1
## 1862:              Redwood City Public Library    1
## 1863:                     Cowper at University    1
# Top five most popular destinations
top_5 <- batrips[, .N, by = .(end_station)][order(-N)][1:5]
top_5
##                                 end_station     N
## 1: San Francisco Caltrain (Townsend at 4th) 33213
## 2:     Harry Bridges Plaza (Ferry Building) 15692
## 3:  San Francisco Caltrain 2 (330 Townsend) 15333
## 4:                        Market at Sansome 14816
## 5:                          2nd at Townsend 14064
# Compute most popular end station for every start station
popular_end_station <- trips_dec[, .(end_station = head(end_station, 1)), by = .(start_station)]
popular_end_station
##                                     start_station
##  1:                               Townsend at 7th
##  2:       San Francisco Caltrain 2 (330 Townsend)
##  3:          Harry Bridges Plaza (Ferry Building)
##  4:                               2nd at Townsend
##  5:                                Market at 10th
##  6:                             2nd at South Park
##  7:                        Embarcadero at Sansome
##  8:                             Market at Sansome
##  9:                             Steuart at Market
## 10:                         Embarcadero at Folsom
## 11:      San Francisco Caltrain (Townsend at 4th)
## 12: Temporary Transbay Terminal (Howard at Beale)
## 13:                                 5th at Howard
## 14:                                 Market at 4th
## 15:                                 2nd at Folsom
## 16:                Mountain View Caltrain Station
## 17:                            Powell Street BART
## 18:               Grant Avenue at Columbus Avenue
## 19:                                 Howard at 2nd
## 20: Yerba Buena Center of the Arts (3rd @ Howard)
## 21:                       Mountain View City Hall
## 22:             Civic Center BART (7th at Market)
## 23:                        Santa Clara at Almaden
## 24:             San Jose Diridon Caltrain Station
## 25:              Castro Street and El Camino Real
## 26:                               Beale at Market
## 27:                         Embarcadero at Bryant
## 28:                      South Van Ness at Market
## 29:                   San Antonio Shopping Center
## 30:                  San Antonio Caltrain Station
## 31:                        Embarcadero at Vallejo
## 32:                          Evelyn Park and Ride
## 33:                     Broadway St at Battery St
## 34:                               Clay at Battery
## 35:                               Spear at Folsom
## 36:                              Davis at Jackson
## 37:                      Commercial at Montgomery
## 38:                              San Pedro Square
## 39:           Mechanics Plaza (Market at Battery)
## 40:                 Powell at Post (Union Square)
## 41:                            San Jose City Hall
## 42:                        University and Emerson
## 43:                           Golden Gate at Polk
## 44:                          Cowper at University
## 45:                          Paseo de San Antonio
## 46:                              Adobe on Almaden
## 47:                                Post at Kearny
## 48:                                 St James Park
## 49:                      Arena Green / SAP Center
## 50:                    Palo Alto Caltrain Station
## 51:                           San Salvador at 1st
## 52:                                   Ryland Park
## 53:                    SJSU - San Salvador at 9th
## 54:               Santa Clara County Civic Center
## 55:                       San Francisco City Hall
## 56:                          Washington at Kearny
## 57:                                   MLK Library
## 58:                         San Jose Civic Center
## 59:         Rengstorff Avenue / California Street
## 60:                                     Japantown
## 61:               California Ave Caltrain Station
## 62:                 Redwood City Caltrain Station
## 63:                                    Mezes Park
## 64:                        SJSU 4th at San Carlos
## 65:                                 Park at Olive
## 66:                               Post at Kearney
## 67:                             Franklin at Maple
## 68:                   Redwood City Public Library
## 69:                   Redwood City Medical Center
## 70:                       San Mateo County Center
## 71:                      Stanford in Redwood City
## 72:                         Washington at Kearney
## 73:                              Broadway at Main
## 74:                    San Jose Government Center
##                                     start_station
##                                       end_station
##  1:      San Francisco Caltrain (Townsend at 4th)
##  2:                               Townsend at 7th
##  3:                        Embarcadero at Sansome
##  4:          Harry Bridges Plaza (Ferry Building)
##  5:      San Francisco Caltrain (Townsend at 4th)
##  6:                             Market at Sansome
##  7:                             Steuart at Market
##  8:                             2nd at South Park
##  9:      San Francisco Caltrain (Townsend at 4th)
## 10:      San Francisco Caltrain (Townsend at 4th)
## 11: Temporary Transbay Terminal (Howard at Beale)
## 12:      San Francisco Caltrain (Townsend at 4th)
## 13:      San Francisco Caltrain (Townsend at 4th)
## 14:      San Francisco Caltrain (Townsend at 4th)
## 15:                             Market at Sansome
## 16:                       Mountain View City Hall
## 17:      San Francisco Caltrain (Townsend at 4th)
## 18:                             Market at Sansome
## 19:      San Francisco Caltrain (Townsend at 4th)
## 20:      San Francisco Caltrain (Townsend at 4th)
## 21:                Mountain View Caltrain Station
## 22:                               Townsend at 7th
## 23:             San Jose Diridon Caltrain Station
## 24:                        Santa Clara at Almaden
## 25:                Mountain View Caltrain Station
## 26:      San Francisco Caltrain (Townsend at 4th)
## 27:      San Francisco Caltrain (Townsend at 4th)
## 28:      San Francisco Caltrain (Townsend at 4th)
## 29:                  San Antonio Caltrain Station
## 30:                   San Antonio Shopping Center
## 31:                             Steuart at Market
## 32:                Mountain View Caltrain Station
## 33:      San Francisco Caltrain (Townsend at 4th)
## 34:      San Francisco Caltrain (Townsend at 4th)
## 35:      San Francisco Caltrain (Townsend at 4th)
## 36:      San Francisco Caltrain (Townsend at 4th)
## 37:      San Francisco Caltrain (Townsend at 4th)
## 38:             San Jose Diridon Caltrain Station
## 39:                                 Market at 4th
## 40:      San Francisco Caltrain (Townsend at 4th)
## 41:             San Jose Diridon Caltrain Station
## 42:                        University and Emerson
## 43:      San Francisco Caltrain (Townsend at 4th)
## 44:                    Palo Alto Caltrain Station
## 45:             San Jose Diridon Caltrain Station
## 46:             San Jose Diridon Caltrain Station
## 47:      San Francisco Caltrain (Townsend at 4th)
## 48:             San Jose Diridon Caltrain Station
## 49:                        Santa Clara at Almaden
## 50:                          Cowper at University
## 51:                                   MLK Library
## 52:               Santa Clara County Civic Center
## 53:             San Jose Diridon Caltrain Station
## 54:                                   Ryland Park
## 55:                            Powell Street BART
## 56:                            Powell Street BART
## 57:             San Jose Diridon Caltrain Station
## 58:                         San Jose Civic Center
## 59:                Mountain View Caltrain Station
## 60:             San Jose Diridon Caltrain Station
## 61:                    Palo Alto Caltrain Station
## 62:                                    Mezes Park
## 63:                 Redwood City Caltrain Station
## 64:                        Santa Clara at Almaden
## 65:                    Palo Alto Caltrain Station
## 66:                         Washington at Kearney
## 67:                 Redwood City Caltrain Station
## 68:                 Redwood City Caltrain Station
## 69:                 Redwood City Caltrain Station
## 70:                       San Mateo County Center
## 71:                 Redwood City Caltrain Station
## 72:                            Powell Street BART
## 73:                              Broadway at Main
## 74:                                     Japantown
##                                       end_station
# Find the first and last ride for each start_station
first_last <- batrips[order(start_date), 
                      .(start_date = c(head(start_date, 1), tail(start_date, 1))), 
                      by = .(start_station)]
first_last
##                        start_station          start_date
##   1:         San Francisco City Hall 2014-01-01 00:14:00
##   2:         San Francisco City Hall 2014-12-31 22:06:00
##   3:          Embarcadero at Sansome 2014-01-01 00:17:00
##   4:          Embarcadero at Sansome 2014-12-31 22:08:00
##   5:               Steuart at Market 2014-01-01 00:23:00
##  ---                                                    
## 144: Santa Clara County Civic Center 2014-12-31 15:32:00
## 145:                     Ryland Park 2014-04-10 09:10:00
## 146:                     Ryland Park 2014-12-31 07:56:00
## 147:        Stanford in Redwood City 2014-09-03 19:41:00
## 148:        Stanford in Redwood City 2014-12-22 16:56:00
relevant_cols <- c("start_station", "end_station", "start_date", "end_date", "duration")

# Find the row corresponding to the shortest trip per month
shortest <- batrips[, .SD[which.min(duration)], by = month(start_date), .SDcols = relevant_cols]
shortest
##     month                                 start_station
##  1:     1                               2nd at Townsend
##  2:     2      San Francisco Caltrain (Townsend at 4th)
##  3:     3           Mechanics Plaza (Market at Battery)
##  4:     4                      South Van Ness at Market
##  5:     5                                Market at 10th
##  6:     6                            Powell Street BART
##  7:     7                            Powell Street BART
##  8:     8                     Broadway St at Battery St
##  9:     9             Civic Center BART (7th at Market)
## 10:    10 Yerba Buena Center of the Arts (3rd @ Howard)
## 11:    11 Temporary Transbay Terminal (Howard at Beale)
## 12:    12                             2nd at South Park
##                                       end_station          start_date
##  1:                               2nd at Townsend 2014-01-21 13:01:00
##  2:      San Francisco Caltrain (Townsend at 4th) 2014-02-08 14:28:00
##  3:           Mechanics Plaza (Market at Battery) 2014-03-18 17:50:00
##  4:                      South Van Ness at Market 2014-04-12 04:28:00
##  5:                                Market at 10th 2014-05-14 20:11:00
##  6:                            Powell Street BART 2014-06-23 17:31:00
##  7:                            Powell Street BART 2014-07-14 14:09:00
##  8:                     Broadway St at Battery St 2014-08-15 15:15:00
##  9:             Civic Center BART (7th at Market) 2014-09-04 10:53:00
## 10: Yerba Buena Center of the Arts (3rd @ Howard) 2014-10-04 19:21:00
## 11: Temporary Transbay Terminal (Howard at Beale) 2014-11-07 09:45:00
## 12:                             2nd at South Park 2014-12-08 11:38:00
##                end_date duration
##  1: 2014-01-21 13:02:00       60
##  2: 2014-02-08 14:29:00       61
##  3: 2014-03-18 17:51:00       60
##  4: 2014-04-12 04:29:00       61
##  5: 2014-05-14 20:12:00       60
##  6: 2014-06-23 17:32:00       60
##  7: 2014-07-14 14:10:00       60
##  8: 2014-08-15 15:16:00       60
##  9: 2014-09-04 10:54:00       60
## 10: 2014-10-04 19:22:00       60
## 11: 2014-11-07 09:46:00       60
## 12: 2014-12-08 11:39:00       60
# Find the total number of unique start stations and zip codes per month
unique_station_month <- batrips[, lapply(.SD, FUN=uniqueN), 
                                by = month(start_date), 
                                .SDcols = c("start_station", "zip_code")]
unique_station_month
##     month start_station zip_code
##  1:     1            68      710
##  2:     2            69      591
##  3:     3            69      894
##  4:     4            70      895
##  5:     5            70     1073
##  6:     6            70     1028
##  7:     7            70     1068
##  8:     8            70     1184
##  9:     9            70      971
## 10:    10            70      991
## 11:    11            70      769
## 12:    12            68      586

Chapter 4 - Reference Semantics

Adding and Updating Columns by Reference:

  • Can add, delete, and update columns in place
    • df <- data.frame(x = 1:5, y = 6:10)
    • df <- data.frame(a = 1:3, b = 4:6, c = 7:9, d = 10:12)
    • df[1:2] <- lapply(df[1:2], function(x) ifelse(x%%2, x, NA))
  • The data.table internals are such that neither the whole frame nor entire columns are deep copied simply to update a few values
    • data.table updates columns in place, i.e., by reference
    • This means, you don’t need the assign the result back to a variable
    • No copy of any column is made while their values are changed
    • data.table uses a new operator := to add/update/delete columns by reference
  • Two ways to use the := operator
    • batrips[, (“is_dur_gt_1hour”, “week_day”) := list(duration > 3600, wday(start_date)]
    • batrips[, is_dur_gt_1hour := duration > 3600] # can skip the parentheses and quotes if there is just a single variable for processing
    • batrips[, :=(is_dur_gt_1hour = NULL, start_station = toupper(start_station))] # note that := is the function that is being called and that NULL means “delete the column”

Grouped Aggregations:

  • Can run grouped aggregations by combining := and by
    • batrips[, n_zip_code := .N, by = zip_code] # nothing will be printed to the console, though number of columns increased by 1
    • zip_1000 <- batrips[n_zip_code > 1000][, n_zip_code := NULL] # can delete the intermediate columns by setting them to NULL

Advanced Aggregations:

  • Can add multiple columns by reference, each within a specified by group
    • batrips[, :=(end_dur_first = duration[1], end_dur_last = duration[.N]), by = end_station]
    • batrips[, c(“end_dur_first”, “end_dur_last”) := list(duration[1], duration[.N]), by = end_station]
  • Can use ifelse statements, and the j argument can have multi-line expressions with LHS=RHS wrapped inside {}
    • batrips[, trip_category := { med_dur = median(duration, na.rm = TRUE) if (med_dur < 600) “short” else if (med_dur >= 600 & med_dur <= 1800) “medium” else “long” }, by = .(start_station, end_station)]
  • Can also apply user-defined functions to achieve the same tasks
    • bin_median_duration <- function(dur) {
    • med_dur <- median(dur, na.rm = TRUE)
    • if (med_dur < 600) “short”
    • else if (med_dur >= 600 & med_dur <= 1800) “medium”
    • else “long”
    • }
    • batrips[, trip_category := bin_median_duration(duration), by = .(start_station, end_station)]
  • Can combine I, j, by all for a single operation
    • batrips[duration > 500, min_dur_gt_500 := min(duration), by = .(start_station, end_station)]

Example code includes:

data(batrips, package="bikeshare14")
batrips <- as.data.table(batrips)

batrips_new = batrips
makeNA <- sample(1:nrow(batrips), round(0.05*nrow(batrips)), replace=FALSE)
batrips_new[makeNA, "duration"] <- NA


# Add a new column, duration_hour
batrips[, duration_hour := duration/3600]


# Print untidy
# untidy[1:2]

# Fix spelling in the second row of start_station
# untidy[2, start_station:="San Francisco City Hall"]

# Replace negative duration values with NA
# untidy[duration < 0, duration:=NA]


# Add a new column equal to total trips for every start station
batrips[, trips_N:=.N, by = start_station]

# Add new column for every start_station and end_station
batrips[, duration_mean:=mean(duration), by = .(start_station, end_station)]


# Calculate the mean duration for each month
batrips_new[, mean_dur:=mean(duration, na.rm=TRUE), by = month(start_date)]

# Replace NA values in duration with the mean value of duration for that month
batrips_new[, mean_dur := mean(duration, na.rm = TRUE), by = month(start_date)][is.na(duration), duration:=round(mean_dur,0)]
## Warning in `[.data.table`(batrips_new[, `:=`(mean_dur, mean(duration, na.rm
## = TRUE)), : Coerced double RHS to integer to match the type of the target
## column (column 2 named 'duration'). The RHS values contain no fractions
## so would be more efficiently created as integer. Consider using R's 'L'
## postfix (typeof(0L) vs typeof(0)) to create constants as integer and avoid
## this warning. Wrapping the RHS with as.integer() will avoid this warning
## too but it's better if possible to create the RHS as integer in the first
## place so that the cost of the coercion can be avoided.
# Delete the mean_dur column by reference
batrips_new[, mean_dur := mean(duration, na.rm = TRUE), by = month(start_date)][is.na(duration), duration := mean_dur][, mean_dur:=NULL]


# Add columns using the LHS := RHS form
batrips[, c("mean_duration", "median_duration"):=.(mean(duration), as.integer(round(median(duration), 0))), by=start_station]

# Add columns using the functional form
batrips[, `:=`(mean_duration=mean(duration), median_duration=as.integer(round(median(duration), 0))), by = start_station]

# Add the mean_duration column
batrips[duration > 600, mean_duration:=mean(duration), by=.(start_station, end_station)]

Chapter 5 - Importing and Exporting Data

Fast data reading with fread():

  • The fread() function is a fast flat-file reader since it imports files in parallel (default is to use all available threads)
    • Can import local files, files from the web, and strings
    • Intelligent defaults - colClasses, sep, nrows etc.
    • Note: Dates and Datetimnes are read as character columns but can be converted later with the excellent fasttime or anytime packages
    • DT1 <- fread(“https://bit.ly/2RkBXhV”)
    • DT2 <- fread(“data.csv”)
    • DT3 <- fread(“a,b1,23,4”)
    • DT4 <- fread(“1,23,4”)
  • The nrows and skip arguments can be helpful for finer control
    • fread(“a,b1,23,4”, nrows = 1) # nrows=1 means read 1 row in addition to the header
    • str <- “# Metadata: 2018-05-01 19:44:28 GMT,b1,23,4”
    • fread(str, skip = 2) # skips the first two lines entirely before attempting to parse the file
  • Can also pass a string argument to skip
    • str <- “# Metadata: 2018-05-01 19:44:28 GMT,b1,23,4”
    • fread(str, skip = “a,b”) # everything before the line “a,b” will be skipped
  • The select and drop arguments allow for control of columns to read
    • str <- “a,b,c1,2,x3,4,y”
    • fread(str, select = c(“a”, “c”))
    • fread(str, drop = “b”) # same as above for this chunk of data
    • str <- “1,2,x3,4,y”
    • fread(str, select = c(1, 3))
    • fread(str, drop = 2) # same as above for this chunk of data

Advanced file reading:

  • By default, R can only read integers less than 2**31 - 1
    • Large integers are automatically read in as integer64 type, provided by the bit64 package
  • Can override the colClasses() guessing that is otherwise automatic in fread()
    • str <- “x1,x2,x3,x4,x51,2,1.5,true,cc3,4,2.5,false,ff”
    • ans <- fread(str, colClasses = c(x5 = “factor”))
    • ans <- fread(str, colClasses = c(“integer”, “integer”, “numeric”, “logical”, “factor”))
    • str <- “x1,x2,x3,x4,x5,x61,2,1.5,2.5,aa,bb3,4,5.5,6.5,cc,dd”
    • ans <- fread(str, colClasses = list(numeric = 1:4, factor = c(“x5”, “x6”))) str(ans) # specifies that columns 1:4 are numeric and that “x5” and “x6” will be factors
  • The fill argument can be used to direct fread() to fill missing values
    • str <- “1,23,4,a5,67,8,b”
    • fread(str) # throws a warning since fill=FALSE is the default
    • fread(str, fill = TRUE) # will assume empty strings for the missing fields
  • Can override the defaults for what to treat as NA
    • str <- “x,y,z1,###,32,4,####N/A,7,9”
    • ans <- fread(str, na.strings = c(“###”, “#N/A”))

Fast data writing with fwrite():

  • The fwrite() function is a fast parallel flat file writer
    • dt <- data.table(id = c(“x”, “y”, “z”), val = list(1:2, 3:4, 5:6))
    • fwrite(dt, “fwrite.csv”) # the list is flattened by using the secondary separtor (default is |)
    • fread(“fwrite.csv”)
  • Dates and datetimes are saved in ISO format for further clarity
    • fwrite() provides three additional ways of writing date and datetime format - ISO, squash and epoch
    • Encourages the use of ISO standards with ISO as default
    • now <- Sys.time()
    • dt <- data.table(date = as.IDate(now), time = as.ITime(now), datetime = now)
    • fwrite(dt, “datetime.csv”, dateTimeAs = “ISO”)
    • fread(“datetime.csv”)
  • Additional date and time formats available - squash
    • squash writes yyyy-mm-dd hh:mm:ss as yyyymmddhhmmss, for example.
    • Read in as integer. Very useful to extract month, year etc by simply using modulo arithmetic. e.g., 20160912 %/% 10000 = 2016
    • Also handles milliseconds (ms) resolution.
    • POSIXct type (17 digits with ms resolution) is automatically read in as integer64 by fread
  • Additional date and time formats available - epoch
    • epoch counts the number of days (for dates) or seconds (for time and datetime) since relevant epoch
    • Relevant epoch is 1970-01-01, 00:00:00 and 1970-01-01T00:00:00Z for date, time and datetime, respectively
    • fwrite(dt, “datetime.csv”, dateTimeAs = “epoch”)
    • fread(“datetime.csv”)

Example code includes:

data(batrips, package="bikeshare14")
batrips <- as.data.table(batrips)
readr::write_csv(batrips, "./RInputFiles/_batrips.csv")


# Use read.csv() to import batrips
system.time(read.csv("./RInputFiles/_batrips.csv"))
##    user  system elapsed 
##    3.46    0.13    3.67
# Use fread() to import batrips
system.time(fread("./RInputFiles/_batrips.csv"))
##    user  system elapsed 
##    0.34    0.04    0.38
cat('id,"name",val
29192,"Robert Whitaker", 200
49301 ,"Elisa Waters",190
', file="./RInputFiles/_sample.csv")


# Import using read.csv()
csv_file <- read.csv("./RInputFiles/_sample.csv", fill = NA, quote = "", stringsAsFactors = FALSE, strip.white = TRUE, header = TRUE)
csv_file
##      id           X.name. val
## 1 29192 "Robert Whitaker" 200
## 2 49301    "Elisa Waters" 190
# Import using fread()
csv_file <- fread("./RInputFiles/_sample.csv")
csv_file
##       id            name val
## 1: 29192 Robert Whitaker 200
## 2: 49301    Elisa Waters 190
cat("id,name,val
29192,Robert Whitaker, 200
49301 ,Elisa Waters,190  
34456 , Karla Schmidt,458
", file="./RInputFiles/_sample.csv")


# Select "id" and "val" columns
select_columns <- fread("./RInputFiles/_sample.csv", select=c("id", "val"))
select_columns
##       id val
## 1: 29192 200
## 2: 49301 190
## 3: 34456 458
# Drop the "val" column
drop_column <- fread("./RInputFiles/_sample.csv", drop=c("val"))
drop_column
##       id            name
## 1: 29192 Robert Whitaker
## 2: 49301    Elisa Waters
## 3: 34456   Karla Schmidt
cat('id,"name",val
29192,"Robert Whitaker", 200
49301 , Elisa Waters,190  
34456 , Karla Schmidt,458  

END-OF-DATA
METADATA
attr;value
date;"2018-01-01"
data;"cash payment" 
', file="./RInputFiles/_sample.csv")


# Import the file
entire_file <- fread("./RInputFiles/_sample.csv")
## Warning in fread("./RInputFiles/_sample.csv"): Stopped early on line 5.
## Expected 3 fields but found 0. Consider fill=TRUE and comment.char=. First
## discarded non-empty line: <<END-OF-DATA>>
entire_file
##       id            name val
## 1: 29192 Robert Whitaker 200
## 2: 49301    Elisa Waters 190
## 3: 34456   Karla Schmidt 458
# Import the file while avoiding the warning
only_data <- fread("./RInputFiles/_sample.csv", nrows=3)
only_data
##       id            name val
## 1: 29192 Robert Whitaker 200
## 2: 49301    Elisa Waters 190
## 3: 34456   Karla Schmidt 458
# Import only the metadata
only_metadata <- fread("./RInputFiles/_sample.csv", skip="attr;value")
only_metadata
##    attr        value
## 1: date   2018-01-01
## 2: data cash payment
cat('id,name,val
9002019291929192,Robert Whitaker, 200
9200129401349301 ,Elisa Waters,190  
9200149429834456 , Karla Schmidt,458
', file="./RInputFiles/_sample.csv")


# Import the file using fread 
fread_import <- fread("./RInputFiles/_sample.csv")

# Import the file using read.csv 
base_import <- read.csv("./RInputFiles/_sample.csv")

# Check the class of id column
class(fread_import$id)
## [1] "integer64"
class(base_import$id)
## [1] "numeric"
cat('c1,c2,c3,c3.1,c5,n1,n2,n3,n4,n5
aa,bb,cc,dd,ee,1,2,3,4,5
ff,gg,hh,ii,jj,6,7,8,9,10
', file="./RInputFiles/_sample.csv")


# Import using read.csv with defaults
base_r_defaults <- read.csv("./RInputFiles/_sample.csv")
str(base_r_defaults)
## 'data.frame':    2 obs. of  10 variables:
##  $ c1  : Factor w/ 2 levels "aa","ff": 1 2
##  $ c2  : Factor w/ 2 levels "bb","gg": 1 2
##  $ c3  : Factor w/ 2 levels "cc","hh": 1 2
##  $ c3.1: Factor w/ 2 levels "dd","ii": 1 2
##  $ c5  : Factor w/ 2 levels "ee","jj": 1 2
##  $ n1  : int  1 6
##  $ n2  : int  2 7
##  $ n3  : int  3 8
##  $ n4  : int  4 9
##  $ n5  : int  5 10
# Import using read.csv
base_r <- read.csv("./RInputFiles/_sample.csv", 
                   colClasses = c(rep("factor", 4), "character", "integer", rep("numeric", 4))
                   )
str(base_r)
## 'data.frame':    2 obs. of  10 variables:
##  $ c1  : Factor w/ 2 levels "aa","ff": 1 2
##  $ c2  : Factor w/ 2 levels "bb","gg": 1 2
##  $ c3  : Factor w/ 2 levels "cc","hh": 1 2
##  $ c3.1: Factor w/ 2 levels "dd","ii": 1 2
##  $ c5  : chr  "ee" "jj"
##  $ n1  : int  1 6
##  $ n2  : num  2 7
##  $ n3  : num  3 8
##  $ n4  : num  4 9
##  $ n5  : num  5 10
# Import using fread
import_fread <- fread("./RInputFiles/_sample.csv", colClasses = list(factor=1:4, numeric=7:10))
str(import_fread)
## Classes 'data.table' and 'data.frame':   2 obs. of  10 variables:
##  $ c1  : Factor w/ 2 levels "aa","ff": 1 2
##  $ c2  : Factor w/ 2 levels "bb","gg": 1 2
##  $ c3  : Factor w/ 2 levels "cc","hh": 1 2
##  $ c3.1: Factor w/ 2 levels "dd","ii": 1 2
##  $ c5  : chr  "ee" "jj"
##  $ n1  : int  1 6
##  $ n2  : num  2 7
##  $ n3  : num  3 8
##  $ n4  : num  4 9
##  $ n5  : num  5 10
##  - attr(*, ".internal.selfref")=<externalptr>
cat('id,name,val
9002019291929192,Robert Whitaker,
9200129401349301 ,Elisa Waters,190  
9200149429834456 , Karla Schmidt
', file="./RInputFiles/_sample.csv")


# Import the file and note the warning message
incorrect <- fread("./RInputFiles/_sample.csv")
## Warning in fread("./RInputFiles/_sample.csv"): Discarded single-line
## footer: <<9200149429834456 , Karla Schmidt>>
incorrect
##                  id            name val
## 1: 9002019291929192 Robert Whitaker  NA
## 2: 9200129401349301    Elisa Waters 190
# Import the file correctly
correct <- fread("./RInputFiles/_sample.csv", fill=TRUE)
correct
##                  id            name val
## 1: 9002019291929192 Robert Whitaker  NA
## 2: 9200129401349301    Elisa Waters 190
## 3: 9200149429834456   Karla Schmidt  NA
# Import the file using na.strings
missing_values <- fread("./RInputFiles/_sample.csv", na.strings="##")
## Warning in fread("./RInputFiles/_sample.csv", na.strings = "##"): Discarded
## single-line footer: <<9200149429834456 , Karla Schmidt>>
missing_values
##                  id            name val
## 1: 9002019291929192 Robert Whitaker  NA
## 2: 9200129401349301    Elisa Waters 190
dt <- data.table(id=c(29192L, 49301L, 34456L), 
                 name=c("Robert, Whitaker", "Elisa, Waters", "Karla, Schmidt"), 
                 vals=list(c(144, 48, 32), c(22, 289), 458)
                 )
dt
##       id             name        vals
## 1: 29192 Robert, Whitaker 144, 48, 32
## 2: 49301    Elisa, Waters      22,289
## 3: 34456   Karla, Schmidt         458
# Write dt to fwrite.txt
fwrite(dt, "./RInputFiles/_fwrite.txt")

# Import the file using readLines()
readLines("./RInputFiles/_fwrite.txt")
## [1] "id,name,vals"                        
## [2] "29192,\"Robert, Whitaker\",144|48|32"
## [3] "49301,\"Elisa, Waters\",22|289"      
## [4] "34456,\"Karla, Schmidt\",458"
# Import the file using fread()
fread("./RInputFiles/_fwrite.txt")
##       id             name      vals
## 1: 29192 Robert, Whitaker 144|48|32
## 2: 49301    Elisa, Waters    22|289
## 3: 34456   Karla, Schmidt       458
batrips_dates <- batrips[1:5, c("start_date", "end_date")]
batrips_dates
##             start_date            end_date
## 1: 2014-01-01 00:14:00 2014-01-01 00:21:00
## 2: 2014-01-01 00:14:00 2014-01-01 00:21:00
## 3: 2014-01-01 00:17:00 2014-01-01 00:42:00
## 4: 2014-01-01 00:23:00 2014-01-01 00:50:00
## 5: 2014-01-01 00:23:00 2014-01-01 00:50:00
# Write batrips_dates to file using "ISO" format
fwrite(batrips_dates, "./RInputFiles/_iso.txt", dateTimeAs="ISO")

# Import the file back
iso <- fread("./RInputFiles/_iso.txt")
iso
##              start_date             end_date
## 1: 2014-01-01T08:14:00Z 2014-01-01T08:21:00Z
## 2: 2014-01-01T08:14:00Z 2014-01-01T08:21:00Z
## 3: 2014-01-01T08:17:00Z 2014-01-01T08:42:00Z
## 4: 2014-01-01T08:23:00Z 2014-01-01T08:50:00Z
## 5: 2014-01-01T08:23:00Z 2014-01-01T08:50:00Z
# Write batrips_dates to file using "squash" format
fwrite(batrips_dates, "./RInputFiles/_squash.txt", dateTimeAs="squash")

# Import the file back
squash <- fread("./RInputFiles/_squash.txt")
squash
##           start_date          end_date
## 1: 20140101081400000 20140101082100000
## 2: 20140101081400000 20140101082100000
## 3: 20140101081700000 20140101084200000
## 4: 20140101082300000 20140101085000000
## 5: 20140101082300000 20140101085000000
# Write batrips_dates to file using "epoch" format
fwrite(batrips_dates, "./RInputFiles/_epoch.txt", dateTimeAs="epoch")

# Import the file back
epoch <- fread("./RInputFiles/_epoch.txt")
epoch
##    start_date   end_date
## 1: 1388564040 1388564460
## 2: 1388564040 1388564460
## 3: 1388564220 1388565720
## 4: 1388564580 1388566200
## 5: 1388564580 1388566200
# Use write.table() to write batrips
system.time(write.table(batrips, "./RInputFiles/_base-r.txt"))
##    user  system elapsed 
##   12.51    0.45   13.59
# Use fwrite() to write batrips
system.time(fwrite(batrips, "./RInputFiles/_data-table.txt"))
##    user  system elapsed 
##    0.24    0.03    0.15

Probability Puzzles in R

Chapter 1 - Introduction and Classic Puzzles

Introduction:

  • Chapter 1 - Classic Problems
  • Chapter 2 - Dice Puzzles
  • Chapter 3 - Web Puzzles
  • Chapter 4 - Poker Games
  • Built-in combinatorics functions will help
    • factorial(3)
    • choose(5,3)
  • Can run simulations using sample(), rbinom(), replicate(), for, while, set.seed() and the like
    • for(i in 1:10){ sum(sample(x = c(1,2,3,4,5,6), size = 2, replace = TRUE)) }
    • rolls <- rep(NA, 10)
    • for(i in 1:10){ rolls[i] <- sum(sample(x = c(1,2,3,4,5,6), size = 2, replace = TRUE)) }

Birthday Problem:

  • Suppose that there are n people in the room, and we want to know the probability that 2+ people share a birthday
    • Ignore February 29th
    • Birthdays are uniformly distributed across the remaining 365 days
    • Each individual in the room is independent
  • Can run a simulation-based approach to the problem
  • There is a built-in function pbirthday() that calculates the exact probability
    • pbirthday(10)
    • room_sizes <- c(1:10)
    • match_probs <- sapply(room_sizes, pbirthday)
    • plot(match_probs ~ room_size)

Monty Hall:

  • Three door problem - one with a prize, and two with nothing
    • Contestant picks a door
    • Host opens a door with nothing
    • Contestant has the choice to switch or not switch
  • Can manage this problem in R with reverse indexing
    • doors <- 1:3
    • reveal <- doors[-c(1,2)] # assumes contestant chose 1 and actual prize is in 2
    • reveal <- sample(x = doors[-1], size = 1) # assumes contestant chose 1 and actual prize is in 1

Example code includes:

# Set seed to 1
set.seed(1)


# Write a function to roll k dice
roll_dice <- function(k){
    all_rolls <- sample(c(1,2,3,4,5,6), k, replace = TRUE)
    final_answer <- sum(all_rolls)
    return(final_answer)
}

# Run the function to roll five dice
roll_dice(5)
## [1] 22
# Initialize a vector to store the output
output <- rep(NA, 10000)

# Loop for 10000 iterations
for(i in 1:10000){
    # Fill in the output vector with the result from rolling two dice
    output[i] <- roll_dice(2)
}


set.seed(1)
n <- 50
match <- 0

# Simulate 10000 rooms and check for matches in each room
for(i in 1:10000){
    birthdays <- sample(1:365, n, replace = TRUE)
    if(length(unique(birthdays)) < n){ match <- match + 1 } 
}

# Calculate the estimated probability of a match and print it
p_match <- match/10000
print(p_match)
## [1] 0.971
# Calculate the probability of a match for a room size of 50
pbirthday(50)
## [1] 0.97
# Define the vector of sample sizes
room_sizes <- 1:50

# Run the pbirthday function within sapply on the vector of sample sizes
match_probs <- sapply(room_sizes, FUN=pbirthday)

# Create the plot
plot(match_probs ~ room_sizes)

set.seed(1)
doors <- c(1,2,3)

# Randomly select one of the doors to have the prize
prize <- sample(x = doors, size = 1)
initial_choice <- 1

# Check if the initial choice equals the prize
if(prize == initial_choice){
    print("The initial choice was correct!")
}

print(prize)
## [1] 2
set.seed(1)
doors <- c(1,2,3)

# Define counter
win_count <- 0

# Run 10000 iterations of the game
for(i in 1:10000){
    prize <- sample(x = doors, size = 1)
    initial_choice <- 1
    if(initial_choice == prize){ win_count <- win_count + 1 }
}

# Print the answer
print(win_count / 10000)
## [1] 0.336
reveal_door <- function(doors, prize, initial_choice){
    if(prize == initial_choice){
        # Sample at random from the two remaining doors
        reveal <- sample(doors[-prize], 1)
    } else {
        reveal <- doors[-c(prize, initial_choice)]
    }  
}

set.seed(1)
prize <- sample(doors,1)
initial_choice <- 1

# Use the reveal_door function to do the reveal
reveal <- reveal_door(doors, prize, initial_choice)

# Switch to the remaining door
final_choice <- doors[-c(initial_choice, reveal)]
print(final_choice)
## [1] 2
# Check whether the final choice equals the prize
if(final_choice==prize){
    print("The final choice is correct!")
}
## [1] "The final choice is correct!"
# Initialize the win counter
win_count <- 0

for(i in 1:10000){
    prize <- sample(doors,1)
    initial_choice <- 1
    reveal <- reveal_door(doors, prize, initial_choice)
    final_choice <- doors[-c(initial_choice, reveal)]
    if(final_choice == prize){
        # Increment the win counter
        win_count <- win_count + 1
    }
}

# Print the estimated probability of winning
print(win_count / 10000)
## [1] 0.666

Chapter 2 - Games with Dice

Yahtzee:

  • Yahtzee is based on rolling 5 dice, and then optionally re-rolling 0-5 of the dice for two more turns
  • When rolling 3 dice, there are 6**3 possible permutations of the dice
  • Can use factorial functions for combinatorics
    • factorial(3)
  • Probability of rolling exactly {3, 4, 5} or {2, 3, 4} - probabilities can be added since they are MECE
    • factorial(3)/6^3 + factorial(3)/6^3
  • The probability of rolling 3-dice Yahtzee
    • 1/6^3 + 1/6^3 + 1/6^3 + 1/6^3 + 1/6^3 + 1/6^3
  • Can also use the choose() function for combinatorics in R
    • choose(3,2)
  • Number of ways to roll 5 of one denomiation and 5 of the other denomination
    • n_denom <- factorial(6) / factorial(4)
    • n_groupings <- choose(10,5) * choose(5,5)
    • n_total <- n_denom * n_groupings

Settlers of Catan:

  • Typical games lasts ~60 rolls, and each spot is labelled 2-12
  • Can simulate dice rolls
    • roll_dice <- function(k){
    • all_rolls <- sample(c(1,2,3,4,5,6), k, replace = TRUE)
    • final_answer <- sum(all_rolls)
    • }
    • replicate(10, roll_dice(2))
    • table(rolls)
    • sum(rolls == 3)

Craps:

  • Pass line bet made prior to the start of a roll
    • 7 or 11 on the first roll wins
    • 2, 3, or 12 on the first roll loses
    • All others become the point, which then needs to be rolled before the next 7 to be a winner
  • There is no set number of rolls, so replicate() and for() od not work as per the previous examples
    • while(roll != 6){ roll <- roll_dice(1); print(roll) }

Example code includes:

# Calculate the size of the sample space
s_space <- 6**5

# Calculate the probability of a Yahtzee
p_yahtzee <- 6 / s_space

# Print the answer
print(p_yahtzee)
## [1] 0.000772
s_space <- 6^5

# Calculate the probabilities
p_12345 <- factorial(5) / s_space
p_23456 <- factorial(5) / s_space
p_large_straight <- p_12345 + p_23456

# Print the large straight probability
print(p_large_straight)
## [1] 0.0309
s_space <- 6^5

# Calculate the number of denominations possible
n_denom <- factorial(6) / factorial(4)

# Calculate the number of ways to form the groups
n_groupings <- choose(5, 3)

# Calculate the total number of full houses
n_full_house <- n_denom * n_groupings

# Calculate and print the answer
print(n_full_house / s_space)
## [1] 0.0386
set.seed(1)

# Simulate one game (60 rolls) and store the result
rolls <- replicate(60, roll_dice(2))

# Display the result
table(rolls)
## rolls
##  2  4  5  6  7  8  9 10 11 12 
##  3  5  7  7 13  9  9  3  3  1
set.seed(1)
counter <- 0

for(i in 1:10000){
    # Roll two dice 60 times
    rolls <- replicate(60, roll_dice(2))
    # Check whether 2 or 12 was rolled more than twice
    if(sum(rolls==2) > 2 | sum(rolls==12) > 2) { counter <- counter + 1 }
}

# Print the answer
print(counter/10000)
## [1] 0.414
roll_after_point <- function(point){
    new_roll <- 0
    # Roll until either a 7 or the point is rolled 
    while( (new_roll != point) & (new_roll != 7) ){
        new_roll <- roll_dice(2)
        if(new_roll == 7){ won <- FALSE }
        # Check whether the new roll gives a win
        if(new_roll == point){ won <- TRUE }
    }
    return(won)
}


evaluate_first_roll <- function(roll){
    # Check whether the first roll gives an immediate win
    if(roll %in% c(7, 11)){ won <- TRUE }
    # Check whether the first roll gives an immediate loss
    if(roll %in% c(2, 3, 12)){ won <- FALSE }
    if(roll %in% c(4,5,6,8,9,10) ){
        # Roll until the point or a 7 is rolled and store the win/lose outcome
        won <- roll_after_point(roll)
    }
    return(won)
}


set.seed(1)
won <- rep(NA, 10000)

for(i in 1:10000){
    # Shooter's first roll
    roll <- roll_dice(2)
    # Determine result and store it
    won[i] <- evaluate_first_roll(roll)
}

sum(won)/10000
## [1] 0.494

Chapter 3 - Inspired from the Web

Factoring a Quadratic:

  • Given random integers a, b, c, what is the probability that the quadratic will factor?
    • The definition of “factorable” is that the solution is rational (can be expressed as the ratio of two integers; is not imaginary)
  • The solution is rational only when the discriminant (b**2 - 4ac) is a perfect square
    • sqrt_dscr <- sqrt(3^2 - 412)
    • sqrt_dscr == round(sqrt_dscr) # basically, is it an integer; note that is.integer() will fail since the integer is reprsented as a float, making this FALSE

Four Digit iPhone Passcodes:

  • Smudge marks on an iPhone can leave clues as to the passcode
  • Research suggests that using a single repeated digit can ENHANCE the security of the passcode
  • The identical function checks whether the full vector is equivalent, as opposed to the element-wise ==
    • identical(set1, set2)

Sign Error Cancellations:

  • Suppose that the possibility of a sign flip is p < 0.5 and suppose that an even number of sign flips gets a correct answer, is the student guaranteed greater than 50% likelihood of getting the right answer?
    • sapply(X, FUN, …, simplify = TRUE, USE.NAMES = TRUE)
    • rbinom(n, size, prob)
    • result <- sapply(X = c(0.25, 0.75, 0.1, 0.9), FUN = rbinom, n = 1, size = 1)

Example code includes:

is_factorable <- function(a,b,c){
    # Check whether solutions are imaginary
    if(b^2 - 4*a*c < 0){ 
        return(FALSE)
        # Designate when the next section should run
    } else {
        sqrt_discriminant <- sqrt(b^2 - 4*a*c) 
        # return TRUE if quadratic is factorable
        return(sqrt_discriminant == round(sqrt_discriminant))
    }
}

counter <- 0

# Nested for loop
for(a in 1:100){
    for(b in 1:100){
        for(c in 1:100){
            # Check whether factorable
            if(is_factorable(a, b, c)){ counter <- counter + 1 }
        }
    }
}

print(counter / 100^3)
## [1] 0.0164
counter <- 0

# Store known values 
values <- c(3, 4, 5, 9)
passcode = values

for(i in 1:10000){
    # Create the guess
    guess <- sample(values, replace=FALSE)
    # Check condition 
    if(identical(passcode, guess)){ counter <- counter + 1 }
}

print(counter/10000)
## [1] 0.0445
counter <- 0
# Store known values
unique_values <- c(2, 4, 7)
passcode = c(unique_values, unique_values[1])

for(i in 1:10000){
    # Pick repeated value
    all_values <- c(unique_values, sample(unique_values, 1))
    # Make guess
    guess <- sample(all_values, replace=FALSE)
    if(identical(passcode, guess)){ counter <- counter + 1 }
}

print(counter / 10000)
## [1] 0.026
set.seed(1)

# Run 10000 iterations, 0.1 sign switch probability
switch_a <- rbinom(10000, 3, prob=0.1)

# Calculate probability of correct answer
mean(switch_a/2==round(switch_a/2))
## [1] 0.764
# Run 10000 iterations, 0.45 sign switch probability
switch_b <- rbinom(10000, 3, prob=0.45)

# Calculate probability of correct answer
mean(switch_b/2==round(switch_b/2))
## [1] 0.508
set.seed(1)
counter <- 0

for(i in 1:10000){
    # Simulate switches
    each_switch <- sapply(c(0.49, 0.1), FUN=rbinom, size=1, n=1)
    # Simulate switches
    num_switches <- sum(each_switch)
    # Check solution
    if(num_switches/2 == round(num_switches/2)){ counter <- counter + 1 }
}

print(counter/10000)
## [1] 0.5

Chapter 4 - Poker

Texas Hold’em:

  • Texas Hold’em is a variant of poker where each player has 2 personal cards and 5 communal cards, with rounds of betting to start and then as the communal cards are shown
  • Can look at probabilities when there are 2 cards left to go or one card left to go
  • Can look at varying numbers of “outs” (winning cards)
    • outs <- c(0,1,2,3)
    • p_lose <- choose(10-outs,2) / choose(10,2)
    • p_win <- 1 - p_lose
  • Can further calculate the expected value of the possible outcomes (sum of profit*probability across all possible outcomes)

Consecutive Cashes:

  • WSOP Main Event held in Las Vegas - survival time is all that matters (not number of chips left at any given time)
    • Roughly the top 10% of players “cash” (win prize money)
    • Ronnie Bardah cashed five straight years - how likely is that?
  • Simplifying assumptions include
    • 6000 players, same by year, all of same talent
    • cash_year1 <- sample(players, 4)
    • cash_year2 <- sample(players, 4)
    • intersect(cash_year1, cash_year2)
  • Challenge for this lesson is to get the intersection of multiple years; can use a matrix for this
    • cashes <- replicate(3, sample(players, 4))
    • in_all_three <- Reduce(intersect, list(cashes[, 1], cashes[, 2], cashes[, 3])) # input to Reduce() must be a list
    • length(in_all_three)

von Neumann Model of Poker:

  • Application of game theory to poker - model stipulates each hand is a random draw from Uniform(0, 1)
    • runif(n, min = 0, max = 1)
    • runif(n = 1)
  • Model assumes that runif() are compared iff both players wagered; otherwise, the wagering player beats the non-wagering player
  • Can use the ifelse() function for modeling

Wrap Up:

  • Combinatorics - choose(), factorial()
  • Simulation - sample(), replicate(), runif()
  • Can continue with more complex combinatorics puzzles
  • Can transfer directly to Monte Carlo and Markov chains

Example code includes:

p_win <- 8 / 46
curr_pot <- 50
bet <- 10

# Define vector of probabilities
probs <- c(p_win, 1-p_win)

# Define vector of values
values <- c(curr_pot, -bet)

# Calculate expected value
sum(probs*values)
## [1] 0.435
outs <- c(0:25)

# Calculate probability of not winning
p_no_outs <- choose(47-outs, 2) /choose(47, 2)

# Calculate probability of winning
p_win <- 1 - p_no_outs

print(p_win)
##  [1] 0.0000 0.0426 0.0842 0.1249 0.1647 0.2035 0.2414 0.2784 0.3145 0.3497
## [11] 0.3839 0.4172 0.4496 0.4810 0.5116 0.5412 0.5698 0.5976 0.6244 0.6503
## [21] 0.6753 0.6994 0.7225 0.7447 0.7660 0.7863
players <- c(1:60)
count <- 0

for(i in 1:10000){
    cash_year1 <- sample(players, 6)
    cash_year2 <- sample(players, 6)
    # Find those who cashed both years
    cash_both <- intersect(cash_year1, cash_year2)
    # Check whether anyone cashed both years
    if(length(cash_both) > 0){ count <- count + 1 }
}

print(count/10000)
## [1] 0.49
check_for_five <- function(cashed){
    # Find intersection of five years
    all_five <- Reduce(intersect, list(cashed[, 1], cashed[, 2], cashed[, 3], cashed[, 4], cashed[, 5]))
    # Check intersection
    if(length(all_five) > 0){ 
        return(TRUE)
        # Specify when to return FALSE
    } else { return(FALSE) }
}


players <- c(1:6000)
count <- 0

for(i in 1:10000){
    # Create matrix of cashing players
    cashes <- replicate(5, sample(players, 600, replace=FALSE))
    # Check for five time winners
    if(check_for_five(cashes)){ count <- count + 1 }
}

print(count/10000)
## [1] 0.0604
# Generate values for both players
A <- runif(1)
B <- runif(1)

# Check winner
if(A > B){
    print("Player A wins")
} else {
    print("Player B wins")
}
## [1] "Player A wins"
print(A)
## [1] 0.706
print(B)
## [1] 0.0876
one_round <- function(bet_cutoff){
    a <- runif(n = 1)
    b <- runif(n = 1)
    # Fill in betting condition
    if(b > bet_cutoff){
        # Return result of bet
        return(ifelse(b > a, 1, -1))
    } else {
        return(0)
    }  
}


b_win <- rep(NA, 10000)

for(i in 1:10000){
    # Run one and store result
    b_win[i] <- one_round(0.5)
}

# Print expected value
mean(b_win)
## [1] 0.241

Highcharter for Finance in R

Chapter 1 - Introduction to Highcharter

Introduction:

  • Highcharts build strong visualizations - requires a license for professional use
  • The highcharter package on CRAN wraps the highcharts package
    • Also called an htmlwidget
    • Extends on package like ggplot by enabling zooming, hovering, etc.
  • Can make an OHLC chart (assuming appropriate underlying data) simply
    • hchart(spy_prices)
    • hchart(spy_prices$open, type = “line”, color = “purple”)

Two highcharter paradigms:

  • Can use either highchart() to draw a blank canvas or hchart(object) to plot the object while creating the highchart()
    • highchart(type = “stock”) %>% hc_add_series(spy_prices) # spy_prices is formatted as xts data
    • hchart(spy_prices) # will guess the best type based on the data (if not specified)
    • hchart(spy_prices_tibble, hcaes(x = date, y = open), type = “line”) # hcaes() is the hchart() equivalent of aes()

Data going forward:

  • The xts data holds the full stock data for the portfolio, indexed by date (all xts objects must have an index)
    • etf_prices_xts
    • etf_prices_xts$SPY # will include the index (date)
    • index(etf_prices_xts) # will show the index
  • Can also store data as a wide tibble
    • etf_prices_wide_tibble # no index, date is an explicit column
    • etf_prices_wide_tibble$SPY # date is not shown since it is not selected
  • Can also store data as a tidy (long) tibble
    • etf_prices_tidy_tibble
    • etf_tidy_tibble_prices %>% filter(symbol == “SPY”) # grab just the SPY data

Example code includes:

load("./RInputFiles/stock_prices_xts.RData")
load("./RInputFiles/stock_tidy_tibble_prices.RData")
load("./RInputFiles/stock_wide_tibble_returns.RData")

str(stock_prices_xts)
## An 'xts' object on 2012-12-31/2017-12-29 containing:
##   Data: num [1:1260, 1:5] 251 257 258 259 268 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:5] "AMZN" "JPM" "DIS" "GOOG" ...
##   Indexed by objects of class: [Date] TZ: UTC
##   xts Attributes:  
## List of 2
##  $ src    : chr "yahoo"
##  $ updated: POSIXct[1:1], format: "2018-12-15 16:31:23"
str(stock_tidy_tibble_prices)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6300 obs. of  3 variables:
##  $ date  : Date, format: "2012-12-31" "2013-01-02" ...
##  $ symbol: chr  "AMZN" "AMZN" "AMZN" "AMZN" ...
##  $ price : num  251 257 258 259 268 ...
str(stock_wide_tibble_returns)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1260 obs. of  6 variables:
##  $ date: Date, format: "2012-12-31" "2013-01-02" ...
##  $ AMZN: num  NA 0.02535 0.00454 0.00259 0.03529 ...
##  $ DIS : num  NA 0.02597 0.00215 0.01896 -0.02365 ...
##  $ GOOG: num  NA 0.022187 0.000581 0.019568 -0.004373 ...
##  $ JPM : num  NA 0.02242 -0.00202 0.01757 0.0011 ...
##  $ KO  : num  NA 0.03656 0 0.00159 -0.0096 ...
load("./RInputFiles/commodities_returns.RData")
load("./RInputFiles/commodities-returns-tidy.RData")
load("./RInputFiles/commodities-xts.RData")

str(commodities_returns)
## Classes 'tbl_df', 'tbl' and 'data.frame':    234 obs. of  4 variables:
##  $ date     : Date, format: "2017-01-04" "2017-01-05" ...
##  $ gold     : num  0.002836 0.013637 -0.00671 0.009753 0.000506 ...
##  $ platinum : num  0.05683 0.00825 0.019 0.016 -0.00132 ...
##  $ palladium: num  0.03897 0.01056 0.01667 0.00309 0 ...
##  - attr(*, "na.action")= 'omit' Named int  1 10 11 35 36 37 75 76 85 86 ...
##   ..- attr(*, "names")= chr  "1" "10" "11" "35" ...
str(commodities_returns_tidy)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  771 obs. of  3 variables:
##  $ date  : Date, format: "2017-01-03" "2017-01-04" ...
##  $ metal : chr  "gold" "gold" "gold" "gold" ...
##  $ return: num  NA 0.00284 0.01364 -0.00671 0.00975 ...
##  - attr(*, "vars")= chr "metal"
##  - attr(*, "drop")= logi TRUE
##  - attr(*, "indices")=List of 3
##   ..$ : int  0 1 2 3 4 5 6 7 8 9 ...
##   ..$ : int  514 515 516 517 518 519 520 521 522 523 ...
##   ..$ : int  257 258 259 260 261 262 263 264 265 266 ...
##  - attr(*, "group_sizes")= int  257 257 257
##  - attr(*, "biggest_group_size")= int 257
##  - attr(*, "labels")='data.frame':   3 obs. of  1 variable:
##   ..$ metal: chr  "gold" "palladium" "platinum"
##   ..- attr(*, "vars")= chr "metal"
##   ..- attr(*, "drop")= logi TRUE
str(commodities_xts)
## An 'xts' object on 2017-01-03/2017-12-29 containing:
##   Data: num [1:257, 1:6] 1162 1165 1181 1173 1185 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:6] "gold" "copper" "oil" "silver" ...
##   Indexed by objects of class: [Date] TZ: UTC
##   xts Attributes:  
##  NULL
# Load the highcharter package
library(highcharter)
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(zoo)


quantmod::getSymbols("XLK", src="yahoo", from="2012-12-31", to="2017-12-31")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "XLK"
fix_vars <- function(x) { tolower(str_split(x, fixed("."))[[1]][2]) }
xlk_prices <- XLK
names(xlk_prices) <- sapply(names(XLK), FUN=fix_vars) %>% unname()


# Build a candlestick chart
hchart(xlk_prices, type = "candlestick")
# Build a ohlc chart
hchart(xlk_prices, type = "ohlc")
# Build a line chart
hchart(xlk_prices$close, type = "line")
# Show the dates
head(index(xlk_prices))
## [1] "2012-12-31" "2013-01-02" "2013-01-03" "2013-01-04" "2013-01-07"
## [6] "2013-01-08"
# Use the base function and set the correct chart type
highchart(type = "stock") %>%
    hc_add_series(xlk_prices)
xlk_prices_tibble <- fortify.zoo(xlk_prices) %>%
    as_tibble() %>%
    rename("date"="Index")
head(xlk_prices_tibble)
## # A tibble: 6 x 7
##   date        open  high   low close   volume adjusted
##   <date>     <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 2012-12-31  28.3  29.0  28.2  28.8 15186300     25.8
## 2 2013-01-02  29.6  29.8  29.5  29.8 15041500     26.7
## 3 2013-01-03  29.8  29.9  29.5  29.6  9789700     26.5
## 4 2013-01-04  29.6  29.6  29.4  29.5  6832900     26.4
## 5 2013-01-07  29.4  29.5  29.3  29.5  7688000     26.4
## 6 2013-01-08  29.5  29.5  29.2  29.4  5565200     26.3
# Create a line chart of the 'close' prices
hchart(xlk_prices_tibble, hcaes(x = date, y = close), type = "line")
# Create a line chart of the open prices
hchart(xlk_prices_tibble, hcaes(x = date, y = open), type = "line")
# Inspect the first rows of the xts data object
head(stock_prices_xts)
##            AMZN  JPM  DIS GOOG   KO
## 2012-12-31  251 37.4 45.8  351 30.0
## 2013-01-02  257 38.3 47.0  359 31.1
## 2013-01-03  258 38.2 47.1  359 31.1
## 2013-01-04  259 38.9 48.0  367 31.2
## 2013-01-07  268 38.9 46.9  365 30.9
## 2013-01-08  266 39.0 46.7  364 30.6
# Extract and show the GOOG column from the xts object
head(stock_prices_xts$GOOG)
##            GOOG
## 2012-12-31  351
## 2013-01-02  359
## 2013-01-03  359
## 2013-01-04  367
## 2013-01-07  365
## 2013-01-08  364
# Display the date index from the xts object
head(index(stock_prices_xts))
## [1] "2012-12-31" "2013-01-02" "2013-01-03" "2013-01-04" "2013-01-07"
## [6] "2013-01-08"
# Extract and show the DIS column from the xts object
head(stock_prices_xts$DIS)
##             DIS
## 2012-12-31 45.8
## 2013-01-02 47.0
## 2013-01-03 47.1
## 2013-01-04 48.0
## 2013-01-07 46.9
## 2013-01-08 46.7
stock_wide_tibble_prices <- stock_tidy_tibble_prices %>%
    tidyr::spread(symbol, price)

# Inspect the first rows of the wide tibble object
head(stock_wide_tibble_prices)
## # A tibble: 6 x 6
##   date        AMZN   DIS  GOOG   JPM    KO
##   <date>     <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2012-12-31  251.  45.8  351.  37.4  30.0
## 2 2013-01-02  257.  47.0  359.  38.3  31.1
## 3 2013-01-03  258.  47.1  359.  38.2  31.1
## 4 2013-01-04  259.  48.0  367.  38.9  31.2
## 5 2013-01-07  268.  46.9  365.  38.9  30.9
## 6 2013-01-08  266.  46.7  364.  39.0  30.6
# Extract and show the GOOG column from the wide tibble data
head(stock_wide_tibble_prices$GOOG)
## [1] 351 359 359 367 365 364
# Display the date information from the wide tibble data
head(stock_wide_tibble_prices$date)
## [1] "2012-12-31" "2013-01-02" "2013-01-03" "2013-01-04" "2013-01-07"
## [6] "2013-01-08"
# Extract and show the DIS column from the wide tibble data
head(stock_wide_tibble_prices$DIS)
## [1] 45.8 47.0 47.1 48.0 46.9 46.7
# Inspect the first rows of the tidy tibble object
head(stock_tidy_tibble_prices)
## # A tibble: 6 x 3
##   date       symbol price
##   <date>     <chr>  <dbl>
## 1 2012-12-31 AMZN    251.
## 2 2013-01-02 AMZN    257.
## 3 2013-01-03 AMZN    258.
## 4 2013-01-04 AMZN    259.
## 5 2013-01-07 AMZN    268.
## 6 2013-01-08 AMZN    266.
# Extract and show the GOOG price data from the tidy tibble data
stock_tidy_tibble_prices %>%
    filter(symbol == "GOOG") %>%
    head()
## # A tibble: 6 x 3
##   date       symbol price
##   <date>     <chr>  <dbl>
## 1 2012-12-31 GOOG    351.
## 2 2013-01-02 GOOG    359.
## 3 2013-01-03 GOOG    359.
## 4 2013-01-04 GOOG    367.
## 5 2013-01-07 GOOG    365.
## 6 2013-01-08 GOOG    364.
# Display the date information from the tidy tibble
head(stock_tidy_tibble_prices$date)
## [1] "2012-12-31" "2013-01-02" "2013-01-03" "2013-01-04" "2013-01-07"
## [6] "2013-01-08"
# Extract and show the DIS price data from the tidy tibble data
stock_tidy_tibble_prices %>%
    filter(symbol == "DIS") %>%
    head()
## # A tibble: 6 x 3
##   date       symbol price
##   <date>     <chr>  <dbl>
## 1 2012-12-31 DIS     45.8
## 2 2013-01-02 DIS     47.0
## 3 2013-01-03 DIS     47.1
## 4 2013-01-04 DIS     48.0
## 5 2013-01-07 DIS     46.9
## 6 2013-01-08 DIS     46.7

Chapter 2 - Highcharter for xts data

Chart the price of one stock in an xts object:

  • Dataset has five ETF prices in a single object
    • etf_prices_xts
  • Can create charts for just a single series in the ETF data
    • highchart(type = “stock”) # create a blank chart canvas, informs to look for a date index since type=“stock”
    • highchart(type = “stock”) %>% hc_add_series(etf_prices_xts$SPY) # include data for series SPY, date is grabbed automatically from the index
    • highchart(type = “stock”) %>% hc_add_series(etf_prices_xts$EEM, color = “green”)

Chart the price of many stocks from xts:

  • Can add multiple series using multiple ncalls to hc_add_series()
    • highchart(type=“stock”) %>% hc_add_series(etf_prices_xts\(SPY) %>% hc_add_series(etf_prices_xts\)IJS)
    • highchart(type = “stock”) %>% hc_add_series(etf_prices_xts\(SPY, color = "blue") %>% hc_add_series(etf_prices_xts\)IJS, color = “red”)
    • highchart(type = “stock”) %>% hc_add_series(etf_prices_xts\(SPY, color = "blue", name = "SPY") %>% hc_add_series(etf_prices_xts\)IJS, color = “red”, name = “IJS”)

Adding a title, subtitle, and axis labels:

  • Good labelling makes charts easier to read and interpret - best practice is to add title, then subtitle, then the rest of the data
    • highchart(type = “stock”) %>% hc_title(text = “5 ETFs Price History”)
    • highchart(type = “stock”) %>% hc_title(text = “5 ETFs Price History”) %>% hc_subtitle(text = “daily prices”)
    • highchart(type = “stock”) %>%
    • hc_title(text = “5 ETFs Price History”) %>%
    • hc_subtitle(text = “daily prices”) %>%
    • hc_add_series(etf_prices_xts$SPY, color = “blue”, name = “SPY”) %>%
    • hc_add_series(etf_prices_xts$IJS, color = “red”, name = “IJS”) %>%
    • hc_add_series(etf_prices_xts$EEM, color = “green”, name = “EEM”) %>%
    • hc_add_series(etf_prices_xts$EFA, color = “purple”, name = “EFA”) %>%
    • hc_add_series(etf_prices_xts$AGG, color = “orange”, name = “AGG”)
  • Can modify axes - labels, positions, number formats, etc. - add hc_yAxis() at the end of plotting code
    • hc_yAxis(title = list(text = “Prices (USD)”), labels = list(format = “${value}”), opposite = FALSE) # opposite=FALSE puts the y-axis on the left (default opposite=TRUE goes on the right)

Tooltips and legends:

  • The tooltip is like a magnifying glass for the viewer
    • hc_tooltip(pointFormat = “text in the tooltip”)
    • hc_tooltip(pointFormat = “\({point.y}") # the text should be a '\)’ followed by the point the user is hovering over
    • hc_tooltip(pointFormat = “${point.y: .2f}”) # round the text that is hovered over to 2 decimal points, and format with a $ sign in front
    • hc_tooltip(pointFormat = “{point.series.name}: \({point.y: .2f}) # will show the series name followed by\)price, rounded to 2 digits
  • Can also add legends to the chart - legends are interactive, and clicking them turns series on or off
    • hc_legend(enabled = TRUE)

Example code includes:

# Chart the price of KO
highchart(type = "stock") %>%
    hc_add_series(stock_prices_xts$KO)
# Fill in the complete highchart code flow to chart GOOG in green
highchart(type = "stock") %>%
    hc_add_series(stock_prices_xts$GOOG, color = "green")
# Fill in the complete highchart code flow to chart DIS in purple
highchart(type = "stock") %>%
    hc_add_series(stock_prices_xts$DIS, color = "purple")
highchart(type = "stock") %>% 
    # Add the price of GOOG, colored orange
    hc_add_series(stock_prices_xts$GOOG, color = "orange") %>% 
    # Add the price of DIS, colored black
    hc_add_series(stock_prices_xts$DIS, color = "black")
highchart(type = "stock") %>% 
    # Add the price of KO, colored green
    hc_add_series(stock_prices_xts$KO, color = "green") %>%
    # Add the price of JPM, colored pink
    hc_add_series(stock_prices_xts$JPM, color = "pink")
highchart(type = "stock") %>%
    # Add JPM as a blue line called JP Morgan
    hc_add_series(stock_prices_xts$JPM, color = "blue", name = "JP Morgan") %>%
    # Add KO as a red line called Coke
    hc_add_series(stock_prices_xts$KO, color = "red", name = "Coke") %>%
    # Add GOOG as a green line named Google
    hc_add_series(stock_prices_xts$GOOG, color = "green", name = "Google") %>%
    # Add DIS as a purple line named Disney
    hc_add_series(stock_prices_xts$DIS, color = "purple", name = "Disney")
highchart(type = "stock") %>%
    # Add the stocks to the chart with the correct color and name
    hc_add_series(stock_prices_xts$JPM, color = "blue", name = "jpm") %>%
    hc_add_series(stock_prices_xts$KO, color = "red", name = "coke") %>%
    hc_add_series(stock_prices_xts$GOOG, color = "green", name = "google") %>%
    hc_add_series(stock_prices_xts$DIS, color = "purple", name = "disney") %>%
    hc_add_series(stock_prices_xts$AMZN, color = "black", name = "amazon")
highchart(type = "stock") %>%
    # Supply the text of the title to hc_title()
    hc_title(text = "A history of two stocks") %>%
    # Supply the text of the subtitle to hc_subtitle()
    hc_subtitle(text = "told with lines") %>% 
    hc_add_series(stock_prices_xts$AMZN, color = "blue", name = "AMZN") %>% 
    hc_add_series(stock_prices_xts$DIS, color = "red", name = "DIS")  %>%
    # Supply the text and format of the y-axis
    hc_yAxis(title = list(text = "Prices (USD)"), labels = list(format = "${value}"), opposite = FALSE)
highchart(type = "stock") %>%
    # Add a title
    hc_title(text = "A history of two stocks") %>% 
    # Add a subtitle
    hc_subtitle(text = "told with lines") %>% 
    hc_add_series(stock_prices_xts$AMZN, color = "blue", name = "AMZN") %>% 
    hc_add_series(stock_prices_xts$DIS, color = "red", name = "DIS")  %>%
    # Change the y-axis title
    hc_yAxis(title = list(text = "in $$$s"), labels = list(format = "{value} USD"), opposite = FALSE)
highchart(type = "stock") %>% 
    hc_add_series(stock_prices_xts$AMZN, color = "blue", name = "AMZN") %>% 
    hc_add_series(stock_prices_xts$DIS, color = "red", name = "DIS")  %>%
    # Add the dollar sign and y-values on a new line
    hc_tooltip(pointFormat = "Daily Price:<br> ${point.y}")
highchart(type = "stock") %>% 
    hc_add_series(stock_prices_xts$AMZN, color = "blue", name = "AMZN") %>% 
    hc_add_series(stock_prices_xts$DIS, color = "red", name = "DIS")  %>% 
    hc_add_series(stock_prices_xts$GOOG, color = "green", name = "GOOG") %>%
    # Add stock names and round the price
    hc_tooltip(pointFormat = "{point.series.name}: ${point.y: .2f}") %>%
    # Enable the legend
    hc_legend(enabled = TRUE)
# Choose the type of highchart
highchart(type = "stock") %>%
    # Add gold, platinum and palladium
    hc_add_series(commodities_xts$gold, color = "yellow", name= "Gold") %>% 
    hc_add_series(commodities_xts$platinum, color = "grey", name= "Platinum") %>% 
    hc_add_series(commodities_xts$palladium, color = "blue", name= "Palladium") %>%
    # Customize the pointFormat of the tooltip
    hc_tooltip(pointFormat = "{point.series.name}: ${point.y} ") %>%
    hc_title(text = "Gold, Platinum and Palladium 2017") %>%
    hc_yAxis(labels = list(format = "${value}"))

Chapter 3 - Highcharter for wide tibble data

Visualizing one stock from wide tibble data:

  • The tibble format includes a column for date, since there is no index as in the xts - requires use of hcaes() to map the columns to x and y
    • hchart(etf_prices_wide_tibble, hcaes(x = date, y = SPY), type = “line”)
    • hchart(etf_prices_wide_tibble, hcaes(x = date, y = SPY), type=“line”, color = “green”, name = “SPY”)

Visualizing multiple stocks from wide tibble data:

  • Can plot multiple series on the same chart, using hchart() %>% hc_add_series() - can use multiple hc_add_series()
    • hchart(etf_prices_wide_tibble, hcaes(x = date, y = SPY), type = “line”) %>%
    • hc_add_series(etf_prices_wide_tibble, hcaes(x = date, y = EEM) type = “line”)
  • The tooltip will hover only over a specific line, rather than all lines for that time point as with xts plotting
    • hc_tooltip(shared = TRUE) # will display all series tooltips at the same time
    • hc_tooltip(shared = TRUE, pointFormat = “{point.series.name}: ${point.y: .2f}”) # format acordingly
    • hc_tooltip(shared = TRUE, pointFormat = “{point.series.name}: ${point.y: .2f}
      ”) #
      is html for line break
  • Can also customize the y-axis labels and formats, as well as adding a legend
    • hc_yAxis(title = list(text = “prices (USD)”), labels = list(format = “${value}”))
    • hc_legend(enabled = TRUE)
  • Each of the hc_add_series() calls needs to also specify inclusion in the legend
    • hc_add_series(etf_prices_wide_tibble, hcaes(x = date, y = EEM) name = “EEM”, type = “line”, showInLegend = TRUE)

Scatterplots from etf_wide_tibble:

  • The wide tiblle is more flexible, specifically in allowing for different chart types including scatter
    • hchart(etf_wide_tibble_returns, hcaes(x = SPY, y = EEM), type = “scatter”)
    • hchart(etf_wide_tibble_returns, hcaes(x = SPY, y = EEM), type = “scatter”, color = “pink”, name = “EEM v. SPY”)
  • Can format the tooltips for better readability
    • hchart(etf_wide_tibble_returns, hcaes(x = SPY, y = EEM), type = “scatter”, color = “pink”, name = “EEM v. SPY”) %>% hc_tooltip(pointFormat = “{point.date}
      EEM: {point.y: .2f}%
      SPY: {point.x: .2f}%”)

Mixing chart types from wide tibble data:

  • Can create multiple chart types on the same plot; for example, regression line on a scatter plot
    • model <- lm(EEM ~ SPY, data = etf_wide_tibble_returns)
    • slope <- coef(model)[2]
    • hchart(etf_wide_tibble_returns, hcaes(x = SPY, y = EEM), type = “scatter”, color = “pink”, name = “EEM v. SPY”) %>% hc_add_series(etf_wide_tibble_returns, hcaes(x = SPY, y = (SPY * slope), type = “line”, color = “blue”, lineWidth = 3)
  • Can further customize titles, axes, tooltips and the like, by using %>% to the next commands
    • hc_title(text = “Scatter plot with regression line”) %>%
    • hc_yAxis(title = list(text = “EEM Daily returns (%)”), labels = list(format = “{value}%”), opposite = FALSE) %>%
    • hc_xAxis(title = list(text = “SPY Daily returns (%)”), labels = list(format = “{value}%”)) %>%
    • hc_tooltip(pointFormat = “{point.date}
      EEM {point.y: .2f}%
      SPY: {point.x: .2f}%”)
  • May want to change the tooltip for the regression line, so that it shows the relevant data for the regression rather than the scatter
    • hc_add_series(etf_wide_tibble_returns, hcaes(x = SPY, y = (SPY * slope), type = “line”, color = “blue”, lineWidth = 3, tooltip = list( headerFormat = “”, pointFormat = “”))
    • tooltip = list( headerFormat = “regression line”, pointFormat = “{point.y: .2f}%”))
    • hc_add_series(etf_wide_tibble_returns, hcaes(x = SPY, y = (SPY * coef(model)[2])), type = “line”, color=“blue”, lineWidth=3, tooltip=list( headerFormat=“regression line”, pointFormat = “{point.y: .2f}%”))

Example code includes:

# Visualize DIS as a line chart 
hchart(stock_wide_tibble_prices, hcaes(x = date, y = DIS),  
       type = "line",
       # Specify the name
       name = "DIS", 
       # Specify the color
       color = "orange"
       )
# Create a green line chart of KO   
hchart(stock_wide_tibble_prices, hcaes(x = date, y = KO), type = "line", color = "green", name = "KO")
# Create a black line chart of JPM  
hchart(stock_wide_tibble_prices, hcaes(x = date, y = JPM), type = "line", color = "black", name = "JPM")
# Create a line chart of KO 
hchart(stock_wide_tibble_prices, hcaes(x = date, y = KO), name = "KO", type = "line") %>%
    # Add JPM to the chart
    hc_add_series(stock_wide_tibble_prices, hcaes(x = date, y = JPM), name = "JPM", type = "line") %>%
    # Add DIS to the chart
    hc_add_series(stock_wide_tibble_prices, hcaes(x = date, y = DIS), name = "DIS", type = "line") %>%
    # Add AMZN to the chart
    hc_add_series(stock_wide_tibble_prices, hcaes(x = date, y = AMZN), name = "AMZN", type = "line") %>%
    # Enable a shared tooltip
    hc_tooltip(shared = TRUE)
hchart(stock_wide_tibble_prices, hcaes(x=date, y=KO), name="KO", type="line", showInLegend = TRUE) %>%  
    # Add JPM to the chart and show it in the legend
    hc_add_series(stock_wide_tibble_prices, hcaes(x=date, y=JPM), name="JPM", type="line", showInLegend=TRUE) %>%
    # Add DIS to the chart and show it in the legend
    hc_add_series(stock_wide_tibble_prices, hcaes(x=date, y=DIS), name="DIS", type="line", showInLegend=TRUE) %>%
    # Add a legend to the chart
    hc_legend(enabled = TRUE)
hchart(stock_wide_tibble_prices, hcaes(x = date, y = KO), name = "KO", type = "line") %>%
    # Add JPM to the chart
    hc_add_series(stock_wide_tibble_prices, hcaes(x=date, y=JPM), name = "JPM", type = "line") %>%
    # Enable a shared tooltip
    hc_tooltip(shared = TRUE, pointFormat = "{point.series.name}: ${point.y: .2f}<br>") %>%
    # Change the text of the title of the y-axis
    hc_yAxis(title = list(text = "prices (USD)"))
# Specify a green scatter plot  
hchart(stock_wide_tibble_returns, hcaes(x = GOOG, y = JPM), 
       type = "scatter", color = "green", name = "GOOG v. JPM"
       ) %>%
    # Make the tooltip display the x and y points and percentage sign
    hc_tooltip(pointFormat = "GOOG: {point.x: .2f}% <br>JPM: {point.y: .2f}%")
hchart(stock_wide_tibble_returns, hcaes(x = KO, y = AMZN), type = "scatter", 
       color = "pink",  name = "GOOG v. AMZN"
       ) %>%
    # Add a custom tooltip format
    hc_tooltip(pointFormat = "{point.date} <br>AMZN: {point.y: .2f}% <br>KO: {point.x: .2f}%")
# Create a scatter plot 
hchart(stock_wide_tibble_returns, hcaes(x = KO, y = GOOG), type = "scatter") %>%
    # Add the slope variable
    hc_add_series(stock_wide_tibble_returns, hcaes(x = KO, y = (KO * 1.15)), type =  "line") %>%
    # Customize the tooltip to show the date, x-, and y-values
    hc_tooltip(pointFormat = "{point.date} <br> GOOG {point.y: .2f}% <br> KO: {point.x: .2f}%")
hchart(stock_wide_tibble_returns, hcaes(x = AMZN, y = DIS), type = "scatter") %>%   
    hc_add_series(stock_wide_tibble_returns, hcaes(x = AMZN, y = (AMZN * .492)), type =  "line",
                  # Add the tooltip argument
                  tooltip = list(
                      # Change the header of the line tooltip
                      headerFormat = "DIS/AMZN linear relationship<br>",
                      # Customize the y value display
                      pointFormat = "{point.y: .2f}%"
                      )
                  ) %>%
    # Customize the scatter tooltip
    hc_tooltip(pointFormat = "{point.date} <br> DIS: {point.y: .2f}% <br> AMZN: {point.x: .2f}%")
# Start the hchart flow for the returns data    
hchart(commodities_returns, type = "scatter", 
       hcaes(x = gold, y = palladium, date = date), color = "pink"
       ) %>%
    # Customize the tooltip
    hc_tooltip(pointFormat = "date: {point.date} <br>palladium: {point.y:.4f} <br>gold: {point.x:.4f} ") %>%
    hc_title(text = "Palladium Versus Gold 2017")

Chapter 4 - Highcharter for tidy tibble data

Tidy data:

  • Financial data tends to be in wide format, though the tidy format is increasingly common in the R world
  • Need to change the mapping approach, with price mapped to the y-axis
    • etf_tidy_tibble_prices %>% filter(symbol == “SPY”) %>% hchart(., hcaes(x = date, y = price), type = “line”) # the . Is how highcharter takes in data from previous step
    • etf_tidy_tibble_prices %>% filter(symbol == “EEM”) %>% hchart(., hcaes(x = date, y = price), type = “line”, color = “green”)

Chart many ETF from a tidy tibble:

  • Tidy format allows for easy plotting of all series
    • etf_tidy_tibble_prices %>% hchart(., hcaes(x = date, y = price, group = symbol), type = “line”)
    • etf_tidy_tibble_prices %>% filter(symbol != “AGG”) %>% hchart(., hcaes(x = date, y = price, group = symbol), type = “line”)
  • Can also customize axes
    • etf_tidy_tibble_prices %>% filter(symbol != “AGG” & symbol ! = “EFA”) %>% hchart(., hcaes(x = date, y = price, group = symbol), type = “line”) %>% hc_title(text = “Tidy Line Charts”) %>% hc_yAxis(title = list(text = “Prices (USD)”), labels = list(format = “${value}”), opposite = FALSE)

Creativity with tidy data:

  • Tidy format allows for easy manipulation and transformation of the underlying data
    • etf_long_returns_tibble %>% summarize(mean = mean(returns), std_dev = sd(returns))
    • etf_tidy_returns_tibble %>% summarize(mean = mean(returns), std_dev = sd(returns)) %>% hchart(., hcaes(x = symbol, y = mean, group = symbol), type = “scatter”)
    • etf_tidy_returns_tibble %>% summarize(mean = mean(returns), std_dev = sd(returns)) %>% hchart(., hcaes(x = symbol, y = mean, group = symbol, size = std_dev), type = “scatter”)
  • Can further display the return to risk ratio
    • etf_tidy_returns_tibble %>% summarize(mean = mean(returns), std_dev = sd(returns), return_risk = mean/std_dev) %>% hchart(., hcaes(x = symbol, y = return_risk, group = symbol), type = “column”)

Tidy tooltips:

  • Can customize the tooltips similar to what was done with xts or wide data
    • etf_tidy_returns_tibble %>% hchart(., hcaes(x = date, y = price, group = symbol), type = “line”) %>% hc_tooltip(pointFormat = “${point.price: .2f}”)
    • etf_tidy_prices_tibble %>% hchart(., hcaes(x = date, y = price, group = symbol), type = “line”) %>% hc_tooltip(pointFormat = “{point.symbol}: ${point.price: .2f}
      ”, shared = TRUE)
  • Can change labels using mutate
    • etf_tidy_prices_tibble %>%
    • mutate(type = case_when(symbol == “EFA” ~ “international”, symbol == “EEM” ~ “emerging”, symbol == “AGG” ~ “bond”, symbol == “IJS” ~ “small-cap”, symbol == “SPY” ~ “market”))
    • etf_tidy_prices_tibble %>%
    • mutate(type = case_when(symbol == “EFA” ~ “international”, symbol == “EEM” ~ “emerging”, symbol == “AGG” ~ “bond”, symbol == “IJS” ~ “small-cap”, symbol == “SPY” ~ “market”)) %>%
    • hchart(., hcaes(x = date, y = price, group = symbol), type = “line”) %>%
    • hc_tooltip(pointFormat = " {point.symbol}: ${point.price: .2f
      fund type: {point.type}“)
  • Can further run this process using summary stats (underlying tibble is grouped by symbol)
    • etf_tidy_returns_tibble %>% summarize(mean = mean(returns), st_dev = sd(returns), max_return = max(returns), min_return = min(returns))
    • etf_tidy_returns_tibble %>%
    • summarize(mean = mean(returns), st_dev = sd(returns), max_return = max(returns), min_return = min(returns)) %>%
    • hchart(., hcaes(x = symbol, y = mean, group = symbol), type = “column”) %>%
    • hc_tooltip(pointFormat = “sd: {point.st_dev: .4f}%
      max: {point.max_return: .4f}%
      min: {point.min_return: .4f}%”)

Wrap up:

  • Tibbles and xts objects for highcharter - wide tibbles, tidy tibles, xts
    • highchart(type=“stock”)
    • hchart()
  • Customized tooltips

Example code includes:

stock_tidy_tibble_prices %>% 
    # Filter by the symbol
    filter(symbol == "KO") %>%
    # Pass the data, choose the mappings and create a line chart
    hchart(., hcaes(x = date, y = price), type = "line", color = "red")
stock_tidy_tibble_prices %>% 
    # Filter the data by symbol
    filter(symbol == "GOOG") %>%
    # Pass the data
    hchart(., hcaes(x = date, y = price), type = "line", color = "purple")
# Chart AMZN as a black line
stock_tidy_tibble_prices %>% 
    filter(symbol == "AMZN") %>%
    hchart(., hcaes(x = date, y = price), type = "line", color = "black")
stock_tidy_tibble_prices %>%
    # Pass in the data
    hchart(., hcaes(x = date, y = price, group = symbol), type = "line") %>%
    # Title the chart
    hc_title(text = "Daily Prices from Tidy Tibble") %>% 
    # Customize the y-axis and move the labels to the left
    hc_yAxis(title = list(text = "Prices (USD)"), labels = list(format = "${value}"), opposite = FALSE)
stock_tidy_tibble_prices %>%
    # Filter the data so it doesn't inclue JP Morgan
    filter(symbol != "JPM") %>%
    # Pass in the data and define the aesthetic mappings
    hchart(., hcaes(x = date, y = price, group = symbol), type = "line")
stock_tidy_tibble_prices %>%
    # Filter the data so it doesn't include Disney and Coke
    filter(!(symbol %in% c("DIS", "KO"))) %>%
    # Pass in the data and define the aesthetic mappings
    hchart(., hcaes(x = date, y = price, group = symbol), type = "line")
stock_tidy_tibble_returns <- stock_tidy_tibble_prices %>%
    arrange(symbol, date) %>%
    group_by(symbol) %>%
    mutate(returns = price / lag(price) - 1) %>%
    filter(!is.na(returns))
str(stock_tidy_tibble_returns)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  6295 obs. of  4 variables:
##  $ date   : Date, format: "2013-01-02" "2013-01-03" ...
##  $ symbol : chr  "AMZN" "AMZN" "AMZN" "AMZN" ...
##  $ price  : num  257 258 259 268 266 ...
##  $ returns: num  0.02567 0.00455 0.00259 0.03593 -0.00775 ...
##  - attr(*, "groups")=Classes 'tbl_df', 'tbl' and 'data.frame':   5 obs. of  2 variables:
##   ..$ symbol: chr  "AMZN" "DIS" "GOOG" "JPM" ...
##   ..$ .rows :List of 5
##   .. ..$ : int  1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ : int  1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 ...
##   .. ..$ : int  2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 ...
##   .. ..$ : int  3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 ...
##   .. ..$ : int  5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 ...
##   ..- attr(*, ".drop")= logi TRUE
stock_tidy_tibble_returns %>%
    # Calculate the standard deviation and mean of returns
    summarize(std_dev = sd(returns), mean = mean(returns)) %>%
    hchart(., hcaes(x = symbol, y = std_dev, color = symbol, size = mean), type = "scatter") %>% 
    hc_title(text = "Standard Dev and Mean Return")
stock_tidy_tibble_returns %>%
    summarize(avg_returns = mean(returns), vol_risk = sd(returns), risk_return = vol_risk/avg_returns) %>%
    # Pass the summary statistics to hchart
    hchart(., hcaes(x = symbol, y = risk_return, group = symbol), type = "column") %>% 
    hc_title(text = "Risk/Return") %>% 
    hc_subtitle(text = "lower bars are better")
stock_tidy_tibble_prices %>%
    mutate(sector = case_when(symbol == "AMZN" ~ "tech", symbol == "GOOG" ~ "tech", symbol == "DIS" ~ "fun",
                              symbol == "JPM" ~ "bank", symbol == "KO" ~ "food")) %>%
    hchart(., hcaes(x = date, y = price, group = symbol), type = "line") %>%
    # Set the tooltip display with curly braces
    hc_tooltip(pointFormat = "{point.symbol}: ${point.y: .2f}<br> sector: {point.sector}")
# Calculate the mean, sd, max and min returns
stock_tidy_tibble_returns %>% 
    summarize(mean = mean(returns), st_dev = sd(returns), 
              max_return = max(returns), min_return = min(returns)
              ) %>%
    hchart(., hcaes(x = symbol, y = st_dev, group = symbol), type = "column") %>% 
    hc_tooltip(pointFormat = "mean: {point.mean: .4f}% <br>max: {point.max_return: .4f}% <br>min: {point.min_return: .4f}%")
# Pass the tidy tibble to hchart()
hchart(commodities_returns_tidy, hcaes(x=date, y=return, group=metal, date=date), type="scatter") %>% 
    hc_title(text = "Gold, Palladium and Platinum Returns 2017") %>%
    # Customize the tooltip
    hc_tooltip(pointFormat = "date: {point.date} <br>{point.metal}: {point.return: .4f}")

Advanced Dimensionality Reduction in R

Chapter 1 - Introduction to Advanced Dimensionality Reduction

Exploring the MNIST Dataset:

  • The t-SNE technique is for t-Distributed Stochastic Neighbor Embedding
  • The GLRM is the Generalized Low Rank Model
  • There are multiple benefits to running dimensionality reduction - feature selection, data compression, vizualization, etc.
  • The MNIST dataset contains 70,000 images of 28x28 pixel handwritten digits

Distance Metrics:

  • Can use distance metrics to quantify similarity between MNIST digits
  • A distance metric is a function for points x, y, z where the output satisfies all of
    • Triangle inequality: d(x,z) <= d(x,y) + d(y,z)
    • Symmetric property: d(x,y) = d(y,x)
    • Non-negativity and identity: d(x,y) >= 0 and d(x,y)=0 only if x=y
  • Euclidean distance is an example - length of the connecting line segment (square root of sum-squared distances by dimension)
    • distances <- dist(mnist_sample[195:200 ,-1])
    • heatmap(as.matrix(distances), Rowv = NA, symm = T, labRow = mnist_sample\(label[195:200], labCol = mnist_sample\)label[195:200])
  • Minkowski family of distances
    • Sum-over-all-dimensions-of[ (absolute-value-distance-on-dimension)**p ]**(1/p) # referred to as the Manhattan distance for p=1
    • distances <- dist(mnist_sample[195:200 ,-1, method = “minkowski”, p = 3])
    • distances <- dist(mnist_sample[195:200 ,-1], method = “manhattan”)
  • The Kullback-Leibler Divergence fails to meet all the criteria
    • Not a metric since it does not satisfy the symmetric and triangle inequality properties
    • Measures differences in probability distributions
    • A divergence of 0 indicates that the two distributions are identical
    • A common distance metric in Machine Learning (t-SNE). For example, in decision trees it is called Information Gain
    • library(philentropy)
    • mnist_6 <- mnist_sample[195:200, -1]
    • mnist_6 <- mnist_6 + 1
    • sums <- rowSums(mnist_6)
    • distances <- distance(mnist_6/sums, method = “kullback-leibler”)
    • heatmap(as.matrix(distances), Rowv = NA, symm = T, labRow = mnist_sample\(label, labCol = mnist_sample\)label)

PCA and t-SNE:

  • The “curse of dimensionality” is that distance metrics do not perform well with high-dimension datasets
  • Principal Component Analysis (PCA) is a well-known technique for dimension reduction
    • pca_result <- prcomp(mnist[, -1])
    • pca_result <- prcomp(mnist[, -1], rank = 2) # get only the first two components
    • plot(pca_result\(x[,1:2], pch = as.character(mnist\)label), col = mnist$label, main = “PCA output”)
  • The t-SNE process (not shown) can sometimes better differentiate the underlying data
    • plot(tsne\(tsne_x, tsne\)tsne_y, pch = as.character(mnist\(label), col = mnist\)label+1, main = “t-SNE output”)

Exampe code includes:

load("./RInputFiles/mnist-sample-200.RData")
load("./RInputFiles/fashion_mnist_500.RData")
load("./RInputFiles/creditcard.RData")

dim(mnist_sample)
## [1] 200 785
dim(fashion_mnist)
## [1] 500 785
str(creditcard)
## Classes 'data.table' and 'data.frame':   28923 obs. of  31 variables:
##  $ Time  : num  406 472 4462 6986 7519 ...
##  $ V1    : num  -2.31 -3.04 -2.3 -4.4 1.23 ...
##  $ V2    : num  1.95 -3.16 1.76 1.36 3.02 ...
##  $ V3    : num  -1.61 1.09 -0.36 -2.59 -4.3 ...
##  $ V4    : num  4 2.29 2.33 2.68 4.73 ...
##  $ V5    : num  -0.522 1.36 -0.822 -1.128 3.624 ...
##  $ V6    : num  -1.4265 -1.0648 -0.0758 -1.7065 -1.3577 ...
##  $ V7    : num  -2.537 0.326 0.562 -3.496 1.713 ...
##  $ V8    : num  1.3917 -0.0678 -0.3991 -0.2488 -0.4964 ...
##  $ V9    : num  -2.77 -0.271 -0.238 -0.248 -1.283 ...
##  $ V10   : num  -2.772 -0.839 -1.525 -4.802 -2.447 ...
##  $ V11   : num  3.202 -0.415 2.033 4.896 2.101 ...
##  $ V12   : num  -2.9 -0.503 -6.56 -10.913 -4.61 ...
##  $ V13   : num  -0.5952 0.6765 0.0229 0.1844 1.4644 ...
##  $ V14   : num  -4.29 -1.69 -1.47 -6.77 -6.08 ...
##  $ V15   : num  0.38972 2.00063 -0.69883 -0.00733 -0.33924 ...
##  $ V16   : num  -1.141 0.667 -2.282 -7.358 2.582 ...
##  $ V17   : num  -2.83 0.6 -4.78 -12.6 6.74 ...
##  $ V18   : num  -0.0168 1.7253 -2.6157 -5.1315 3.0425 ...
##  $ V19   : num  0.417 0.283 -1.334 0.308 -2.722 ...
##  $ V20   : num  0.12691 2.10234 -0.43002 -0.17161 0.00906 ...
##  $ V21   : num  0.517 0.662 -0.294 0.574 -0.379 ...
##  $ V22   : num  -0.035 0.435 -0.932 0.177 -0.704 ...
##  $ V23   : num  -0.465 1.376 0.173 -0.436 -0.657 ...
##  $ V24   : num  0.3202 -0.2938 -0.0873 -0.0535 -1.6327 ...
##  $ V25   : num  0.0445 0.2798 -0.1561 0.2524 1.4889 ...
##  $ V26   : num  0.178 -0.145 -0.543 -0.657 0.567 ...
##  $ V27   : num  0.2611 -0.2528 0.0396 -0.8271 -0.01 ...
##  $ V28   : num  -0.1433 0.0358 -0.153 0.8496 0.1468 ...
##  $ Amount: num  0 529 240 59 1 ...
##  $ Class : chr  "1" "1" "1" "1" ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Have a look at the MNIST dataset names
names(mnist_sample)
##   [1] "label"    "pixel0"   "pixel1"   "pixel2"   "pixel3"   "pixel4"  
##   [7] "pixel5"   "pixel6"   "pixel7"   "pixel8"   "pixel9"   "pixel10" 
##  [13] "pixel11"  "pixel12"  "pixel13"  "pixel14"  "pixel15"  "pixel16" 
##  [19] "pixel17"  "pixel18"  "pixel19"  "pixel20"  "pixel21"  "pixel22" 
##  [25] "pixel23"  "pixel24"  "pixel25"  "pixel26"  "pixel27"  "pixel28" 
##  [31] "pixel29"  "pixel30"  "pixel31"  "pixel32"  "pixel33"  "pixel34" 
##  [37] "pixel35"  "pixel36"  "pixel37"  "pixel38"  "pixel39"  "pixel40" 
##  [43] "pixel41"  "pixel42"  "pixel43"  "pixel44"  "pixel45"  "pixel46" 
##  [49] "pixel47"  "pixel48"  "pixel49"  "pixel50"  "pixel51"  "pixel52" 
##  [55] "pixel53"  "pixel54"  "pixel55"  "pixel56"  "pixel57"  "pixel58" 
##  [61] "pixel59"  "pixel60"  "pixel61"  "pixel62"  "pixel63"  "pixel64" 
##  [67] "pixel65"  "pixel66"  "pixel67"  "pixel68"  "pixel69"  "pixel70" 
##  [73] "pixel71"  "pixel72"  "pixel73"  "pixel74"  "pixel75"  "pixel76" 
##  [79] "pixel77"  "pixel78"  "pixel79"  "pixel80"  "pixel81"  "pixel82" 
##  [85] "pixel83"  "pixel84"  "pixel85"  "pixel86"  "pixel87"  "pixel88" 
##  [91] "pixel89"  "pixel90"  "pixel91"  "pixel92"  "pixel93"  "pixel94" 
##  [97] "pixel95"  "pixel96"  "pixel97"  "pixel98"  "pixel99"  "pixel100"
## [103] "pixel101" "pixel102" "pixel103" "pixel104" "pixel105" "pixel106"
## [109] "pixel107" "pixel108" "pixel109" "pixel110" "pixel111" "pixel112"
## [115] "pixel113" "pixel114" "pixel115" "pixel116" "pixel117" "pixel118"
## [121] "pixel119" "pixel120" "pixel121" "pixel122" "pixel123" "pixel124"
## [127] "pixel125" "pixel126" "pixel127" "pixel128" "pixel129" "pixel130"
## [133] "pixel131" "pixel132" "pixel133" "pixel134" "pixel135" "pixel136"
## [139] "pixel137" "pixel138" "pixel139" "pixel140" "pixel141" "pixel142"
## [145] "pixel143" "pixel144" "pixel145" "pixel146" "pixel147" "pixel148"
## [151] "pixel149" "pixel150" "pixel151" "pixel152" "pixel153" "pixel154"
## [157] "pixel155" "pixel156" "pixel157" "pixel158" "pixel159" "pixel160"
## [163] "pixel161" "pixel162" "pixel163" "pixel164" "pixel165" "pixel166"
## [169] "pixel167" "pixel168" "pixel169" "pixel170" "pixel171" "pixel172"
## [175] "pixel173" "pixel174" "pixel175" "pixel176" "pixel177" "pixel178"
## [181] "pixel179" "pixel180" "pixel181" "pixel182" "pixel183" "pixel184"
## [187] "pixel185" "pixel186" "pixel187" "pixel188" "pixel189" "pixel190"
## [193] "pixel191" "pixel192" "pixel193" "pixel194" "pixel195" "pixel196"
## [199] "pixel197" "pixel198" "pixel199" "pixel200" "pixel201" "pixel202"
## [205] "pixel203" "pixel204" "pixel205" "pixel206" "pixel207" "pixel208"
## [211] "pixel209" "pixel210" "pixel211" "pixel212" "pixel213" "pixel214"
## [217] "pixel215" "pixel216" "pixel217" "pixel218" "pixel219" "pixel220"
## [223] "pixel221" "pixel222" "pixel223" "pixel224" "pixel225" "pixel226"
## [229] "pixel227" "pixel228" "pixel229" "pixel230" "pixel231" "pixel232"
## [235] "pixel233" "pixel234" "pixel235" "pixel236" "pixel237" "pixel238"
## [241] "pixel239" "pixel240" "pixel241" "pixel242" "pixel243" "pixel244"
## [247] "pixel245" "pixel246" "pixel247" "pixel248" "pixel249" "pixel250"
## [253] "pixel251" "pixel252" "pixel253" "pixel254" "pixel255" "pixel256"
## [259] "pixel257" "pixel258" "pixel259" "pixel260" "pixel261" "pixel262"
## [265] "pixel263" "pixel264" "pixel265" "pixel266" "pixel267" "pixel268"
## [271] "pixel269" "pixel270" "pixel271" "pixel272" "pixel273" "pixel274"
## [277] "pixel275" "pixel276" "pixel277" "pixel278" "pixel279" "pixel280"
## [283] "pixel281" "pixel282" "pixel283" "pixel284" "pixel285" "pixel286"
## [289] "pixel287" "pixel288" "pixel289" "pixel290" "pixel291" "pixel292"
## [295] "pixel293" "pixel294" "pixel295" "pixel296" "pixel297" "pixel298"
## [301] "pixel299" "pixel300" "pixel301" "pixel302" "pixel303" "pixel304"
## [307] "pixel305" "pixel306" "pixel307" "pixel308" "pixel309" "pixel310"
## [313] "pixel311" "pixel312" "pixel313" "pixel314" "pixel315" "pixel316"
## [319] "pixel317" "pixel318" "pixel319" "pixel320" "pixel321" "pixel322"
## [325] "pixel323" "pixel324" "pixel325" "pixel326" "pixel327" "pixel328"
## [331] "pixel329" "pixel330" "pixel331" "pixel332" "pixel333" "pixel334"
## [337] "pixel335" "pixel336" "pixel337" "pixel338" "pixel339" "pixel340"
## [343] "pixel341" "pixel342" "pixel343" "pixel344" "pixel345" "pixel346"
## [349] "pixel347" "pixel348" "pixel349" "pixel350" "pixel351" "pixel352"
## [355] "pixel353" "pixel354" "pixel355" "pixel356" "pixel357" "pixel358"
## [361] "pixel359" "pixel360" "pixel361" "pixel362" "pixel363" "pixel364"
## [367] "pixel365" "pixel366" "pixel367" "pixel368" "pixel369" "pixel370"
## [373] "pixel371" "pixel372" "pixel373" "pixel374" "pixel375" "pixel376"
## [379] "pixel377" "pixel378" "pixel379" "pixel380" "pixel381" "pixel382"
## [385] "pixel383" "pixel384" "pixel385" "pixel386" "pixel387" "pixel388"
## [391] "pixel389" "pixel390" "pixel391" "pixel392" "pixel393" "pixel394"
## [397] "pixel395" "pixel396" "pixel397" "pixel398" "pixel399" "pixel400"
## [403] "pixel401" "pixel402" "pixel403" "pixel404" "pixel405" "pixel406"
## [409] "pixel407" "pixel408" "pixel409" "pixel410" "pixel411" "pixel412"
## [415] "pixel413" "pixel414" "pixel415" "pixel416" "pixel417" "pixel418"
## [421] "pixel419" "pixel420" "pixel421" "pixel422" "pixel423" "pixel424"
## [427] "pixel425" "pixel426" "pixel427" "pixel428" "pixel429" "pixel430"
## [433] "pixel431" "pixel432" "pixel433" "pixel434" "pixel435" "pixel436"
## [439] "pixel437" "pixel438" "pixel439" "pixel440" "pixel441" "pixel442"
## [445] "pixel443" "pixel444" "pixel445" "pixel446" "pixel447" "pixel448"
## [451] "pixel449" "pixel450" "pixel451" "pixel452" "pixel453" "pixel454"
## [457] "pixel455" "pixel456" "pixel457" "pixel458" "pixel459" "pixel460"
## [463] "pixel461" "pixel462" "pixel463" "pixel464" "pixel465" "pixel466"
## [469] "pixel467" "pixel468" "pixel469" "pixel470" "pixel471" "pixel472"
## [475] "pixel473" "pixel474" "pixel475" "pixel476" "pixel477" "pixel478"
## [481] "pixel479" "pixel480" "pixel481" "pixel482" "pixel483" "pixel484"
## [487] "pixel485" "pixel486" "pixel487" "pixel488" "pixel489" "pixel490"
## [493] "pixel491" "pixel492" "pixel493" "pixel494" "pixel495" "pixel496"
## [499] "pixel497" "pixel498" "pixel499" "pixel500" "pixel501" "pixel502"
## [505] "pixel503" "pixel504" "pixel505" "pixel506" "pixel507" "pixel508"
## [511] "pixel509" "pixel510" "pixel511" "pixel512" "pixel513" "pixel514"
## [517] "pixel515" "pixel516" "pixel517" "pixel518" "pixel519" "pixel520"
## [523] "pixel521" "pixel522" "pixel523" "pixel524" "pixel525" "pixel526"
## [529] "pixel527" "pixel528" "pixel529" "pixel530" "pixel531" "pixel532"
## [535] "pixel533" "pixel534" "pixel535" "pixel536" "pixel537" "pixel538"
## [541] "pixel539" "pixel540" "pixel541" "pixel542" "pixel543" "pixel544"
## [547] "pixel545" "pixel546" "pixel547" "pixel548" "pixel549" "pixel550"
## [553] "pixel551" "pixel552" "pixel553" "pixel554" "pixel555" "pixel556"
## [559] "pixel557" "pixel558" "pixel559" "pixel560" "pixel561" "pixel562"
## [565] "pixel563" "pixel564" "pixel565" "pixel566" "pixel567" "pixel568"
## [571] "pixel569" "pixel570" "pixel571" "pixel572" "pixel573" "pixel574"
## [577] "pixel575" "pixel576" "pixel577" "pixel578" "pixel579" "pixel580"
## [583] "pixel581" "pixel582" "pixel583" "pixel584" "pixel585" "pixel586"
## [589] "pixel587" "pixel588" "pixel589" "pixel590" "pixel591" "pixel592"
## [595] "pixel593" "pixel594" "pixel595" "pixel596" "pixel597" "pixel598"
## [601] "pixel599" "pixel600" "pixel601" "pixel602" "pixel603" "pixel604"
## [607] "pixel605" "pixel606" "pixel607" "pixel608" "pixel609" "pixel610"
## [613] "pixel611" "pixel612" "pixel613" "pixel614" "pixel615" "pixel616"
## [619] "pixel617" "pixel618" "pixel619" "pixel620" "pixel621" "pixel622"
## [625] "pixel623" "pixel624" "pixel625" "pixel626" "pixel627" "pixel628"
## [631] "pixel629" "pixel630" "pixel631" "pixel632" "pixel633" "pixel634"
## [637] "pixel635" "pixel636" "pixel637" "pixel638" "pixel639" "pixel640"
## [643] "pixel641" "pixel642" "pixel643" "pixel644" "pixel645" "pixel646"
## [649] "pixel647" "pixel648" "pixel649" "pixel650" "pixel651" "pixel652"
## [655] "pixel653" "pixel654" "pixel655" "pixel656" "pixel657" "pixel658"
## [661] "pixel659" "pixel660" "pixel661" "pixel662" "pixel663" "pixel664"
## [667] "pixel665" "pixel666" "pixel667" "pixel668" "pixel669" "pixel670"
## [673] "pixel671" "pixel672" "pixel673" "pixel674" "pixel675" "pixel676"
## [679] "pixel677" "pixel678" "pixel679" "pixel680" "pixel681" "pixel682"
## [685] "pixel683" "pixel684" "pixel685" "pixel686" "pixel687" "pixel688"
## [691] "pixel689" "pixel690" "pixel691" "pixel692" "pixel693" "pixel694"
## [697] "pixel695" "pixel696" "pixel697" "pixel698" "pixel699" "pixel700"
## [703] "pixel701" "pixel702" "pixel703" "pixel704" "pixel705" "pixel706"
## [709] "pixel707" "pixel708" "pixel709" "pixel710" "pixel711" "pixel712"
## [715] "pixel713" "pixel714" "pixel715" "pixel716" "pixel717" "pixel718"
## [721] "pixel719" "pixel720" "pixel721" "pixel722" "pixel723" "pixel724"
## [727] "pixel725" "pixel726" "pixel727" "pixel728" "pixel729" "pixel730"
## [733] "pixel731" "pixel732" "pixel733" "pixel734" "pixel735" "pixel736"
## [739] "pixel737" "pixel738" "pixel739" "pixel740" "pixel741" "pixel742"
## [745] "pixel743" "pixel744" "pixel745" "pixel746" "pixel747" "pixel748"
## [751] "pixel749" "pixel750" "pixel751" "pixel752" "pixel753" "pixel754"
## [757] "pixel755" "pixel756" "pixel757" "pixel758" "pixel759" "pixel760"
## [763] "pixel761" "pixel762" "pixel763" "pixel764" "pixel765" "pixel766"
## [769] "pixel767" "pixel768" "pixel769" "pixel770" "pixel771" "pixel772"
## [775] "pixel773" "pixel774" "pixel775" "pixel776" "pixel777" "pixel778"
## [781] "pixel779" "pixel780" "pixel781" "pixel782" "pixel783"
names(fashion_mnist)
##   [1] "label"    "pixel1"   "pixel2"   "pixel3"   "pixel4"   "pixel5"  
##   [7] "pixel6"   "pixel7"   "pixel8"   "pixel9"   "pixel10"  "pixel11" 
##  [13] "pixel12"  "pixel13"  "pixel14"  "pixel15"  "pixel16"  "pixel17" 
##  [19] "pixel18"  "pixel19"  "pixel20"  "pixel21"  "pixel22"  "pixel23" 
##  [25] "pixel24"  "pixel25"  "pixel26"  "pixel27"  "pixel28"  "pixel29" 
##  [31] "pixel30"  "pixel31"  "pixel32"  "pixel33"  "pixel34"  "pixel35" 
##  [37] "pixel36"  "pixel37"  "pixel38"  "pixel39"  "pixel40"  "pixel41" 
##  [43] "pixel42"  "pixel43"  "pixel44"  "pixel45"  "pixel46"  "pixel47" 
##  [49] "pixel48"  "pixel49"  "pixel50"  "pixel51"  "pixel52"  "pixel53" 
##  [55] "pixel54"  "pixel55"  "pixel56"  "pixel57"  "pixel58"  "pixel59" 
##  [61] "pixel60"  "pixel61"  "pixel62"  "pixel63"  "pixel64"  "pixel65" 
##  [67] "pixel66"  "pixel67"  "pixel68"  "pixel69"  "pixel70"  "pixel71" 
##  [73] "pixel72"  "pixel73"  "pixel74"  "pixel75"  "pixel76"  "pixel77" 
##  [79] "pixel78"  "pixel79"  "pixel80"  "pixel81"  "pixel82"  "pixel83" 
##  [85] "pixel84"  "pixel85"  "pixel86"  "pixel87"  "pixel88"  "pixel89" 
##  [91] "pixel90"  "pixel91"  "pixel92"  "pixel93"  "pixel94"  "pixel95" 
##  [97] "pixel96"  "pixel97"  "pixel98"  "pixel99"  "pixel100" "pixel101"
## [103] "pixel102" "pixel103" "pixel104" "pixel105" "pixel106" "pixel107"
## [109] "pixel108" "pixel109" "pixel110" "pixel111" "pixel112" "pixel113"
## [115] "pixel114" "pixel115" "pixel116" "pixel117" "pixel118" "pixel119"
## [121] "pixel120" "pixel121" "pixel122" "pixel123" "pixel124" "pixel125"
## [127] "pixel126" "pixel127" "pixel128" "pixel129" "pixel130" "pixel131"
## [133] "pixel132" "pixel133" "pixel134" "pixel135" "pixel136" "pixel137"
## [139] "pixel138" "pixel139" "pixel140" "pixel141" "pixel142" "pixel143"
## [145] "pixel144" "pixel145" "pixel146" "pixel147" "pixel148" "pixel149"
## [151] "pixel150" "pixel151" "pixel152" "pixel153" "pixel154" "pixel155"
## [157] "pixel156" "pixel157" "pixel158" "pixel159" "pixel160" "pixel161"
## [163] "pixel162" "pixel163" "pixel164" "pixel165" "pixel166" "pixel167"
## [169] "pixel168" "pixel169" "pixel170" "pixel171" "pixel172" "pixel173"
## [175] "pixel174" "pixel175" "pixel176" "pixel177" "pixel178" "pixel179"
## [181] "pixel180" "pixel181" "pixel182" "pixel183" "pixel184" "pixel185"
## [187] "pixel186" "pixel187" "pixel188" "pixel189" "pixel190" "pixel191"
## [193] "pixel192" "pixel193" "pixel194" "pixel195" "pixel196" "pixel197"
## [199] "pixel198" "pixel199" "pixel200" "pixel201" "pixel202" "pixel203"
## [205] "pixel204" "pixel205" "pixel206" "pixel207" "pixel208" "pixel209"
## [211] "pixel210" "pixel211" "pixel212" "pixel213" "pixel214" "pixel215"
## [217] "pixel216" "pixel217" "pixel218" "pixel219" "pixel220" "pixel221"
## [223] "pixel222" "pixel223" "pixel224" "pixel225" "pixel226" "pixel227"
## [229] "pixel228" "pixel229" "pixel230" "pixel231" "pixel232" "pixel233"
## [235] "pixel234" "pixel235" "pixel236" "pixel237" "pixel238" "pixel239"
## [241] "pixel240" "pixel241" "pixel242" "pixel243" "pixel244" "pixel245"
## [247] "pixel246" "pixel247" "pixel248" "pixel249" "pixel250" "pixel251"
## [253] "pixel252" "pixel253" "pixel254" "pixel255" "pixel256" "pixel257"
## [259] "pixel258" "pixel259" "pixel260" "pixel261" "pixel262" "pixel263"
## [265] "pixel264" "pixel265" "pixel266" "pixel267" "pixel268" "pixel269"
## [271] "pixel270" "pixel271" "pixel272" "pixel273" "pixel274" "pixel275"
## [277] "pixel276" "pixel277" "pixel278" "pixel279" "pixel280" "pixel281"
## [283] "pixel282" "pixel283" "pixel284" "pixel285" "pixel286" "pixel287"
## [289] "pixel288" "pixel289" "pixel290" "pixel291" "pixel292" "pixel293"
## [295] "pixel294" "pixel295" "pixel296" "pixel297" "pixel298" "pixel299"
## [301] "pixel300" "pixel301" "pixel302" "pixel303" "pixel304" "pixel305"
## [307] "pixel306" "pixel307" "pixel308" "pixel309" "pixel310" "pixel311"
## [313] "pixel312" "pixel313" "pixel314" "pixel315" "pixel316" "pixel317"
## [319] "pixel318" "pixel319" "pixel320" "pixel321" "pixel322" "pixel323"
## [325] "pixel324" "pixel325" "pixel326" "pixel327" "pixel328" "pixel329"
## [331] "pixel330" "pixel331" "pixel332" "pixel333" "pixel334" "pixel335"
## [337] "pixel336" "pixel337" "pixel338" "pixel339" "pixel340" "pixel341"
## [343] "pixel342" "pixel343" "pixel344" "pixel345" "pixel346" "pixel347"
## [349] "pixel348" "pixel349" "pixel350" "pixel351" "pixel352" "pixel353"
## [355] "pixel354" "pixel355" "pixel356" "pixel357" "pixel358" "pixel359"
## [361] "pixel360" "pixel361" "pixel362" "pixel363" "pixel364" "pixel365"
## [367] "pixel366" "pixel367" "pixel368" "pixel369" "pixel370" "pixel371"
## [373] "pixel372" "pixel373" "pixel374" "pixel375" "pixel376" "pixel377"
## [379] "pixel378" "pixel379" "pixel380" "pixel381" "pixel382" "pixel383"
## [385] "pixel384" "pixel385" "pixel386" "pixel387" "pixel388" "pixel389"
## [391] "pixel390" "pixel391" "pixel392" "pixel393" "pixel394" "pixel395"
## [397] "pixel396" "pixel397" "pixel398" "pixel399" "pixel400" "pixel401"
## [403] "pixel402" "pixel403" "pixel404" "pixel405" "pixel406" "pixel407"
## [409] "pixel408" "pixel409" "pixel410" "pixel411" "pixel412" "pixel413"
## [415] "pixel414" "pixel415" "pixel416" "pixel417" "pixel418" "pixel419"
## [421] "pixel420" "pixel421" "pixel422" "pixel423" "pixel424" "pixel425"
## [427] "pixel426" "pixel427" "pixel428" "pixel429" "pixel430" "pixel431"
## [433] "pixel432" "pixel433" "pixel434" "pixel435" "pixel436" "pixel437"
## [439] "pixel438" "pixel439" "pixel440" "pixel441" "pixel442" "pixel443"
## [445] "pixel444" "pixel445" "pixel446" "pixel447" "pixel448" "pixel449"
## [451] "pixel450" "pixel451" "pixel452" "pixel453" "pixel454" "pixel455"
## [457] "pixel456" "pixel457" "pixel458" "pixel459" "pixel460" "pixel461"
## [463] "pixel462" "pixel463" "pixel464" "pixel465" "pixel466" "pixel467"
## [469] "pixel468" "pixel469" "pixel470" "pixel471" "pixel472" "pixel473"
## [475] "pixel474" "pixel475" "pixel476" "pixel477" "pixel478" "pixel479"
## [481] "pixel480" "pixel481" "pixel482" "pixel483" "pixel484" "pixel485"
## [487] "pixel486" "pixel487" "pixel488" "pixel489" "pixel490" "pixel491"
## [493] "pixel492" "pixel493" "pixel494" "pixel495" "pixel496" "pixel497"
## [499] "pixel498" "pixel499" "pixel500" "pixel501" "pixel502" "pixel503"
## [505] "pixel504" "pixel505" "pixel506" "pixel507" "pixel508" "pixel509"
## [511] "pixel510" "pixel511" "pixel512" "pixel513" "pixel514" "pixel515"
## [517] "pixel516" "pixel517" "pixel518" "pixel519" "pixel520" "pixel521"
## [523] "pixel522" "pixel523" "pixel524" "pixel525" "pixel526" "pixel527"
## [529] "pixel528" "pixel529" "pixel530" "pixel531" "pixel532" "pixel533"
## [535] "pixel534" "pixel535" "pixel536" "pixel537" "pixel538" "pixel539"
## [541] "pixel540" "pixel541" "pixel542" "pixel543" "pixel544" "pixel545"
## [547] "pixel546" "pixel547" "pixel548" "pixel549" "pixel550" "pixel551"
## [553] "pixel552" "pixel553" "pixel554" "pixel555" "pixel556" "pixel557"
## [559] "pixel558" "pixel559" "pixel560" "pixel561" "pixel562" "pixel563"
## [565] "pixel564" "pixel565" "pixel566" "pixel567" "pixel568" "pixel569"
## [571] "pixel570" "pixel571" "pixel572" "pixel573" "pixel574" "pixel575"
## [577] "pixel576" "pixel577" "pixel578" "pixel579" "pixel580" "pixel581"
## [583] "pixel582" "pixel583" "pixel584" "pixel585" "pixel586" "pixel587"
## [589] "pixel588" "pixel589" "pixel590" "pixel591" "pixel592" "pixel593"
## [595] "pixel594" "pixel595" "pixel596" "pixel597" "pixel598" "pixel599"
## [601] "pixel600" "pixel601" "pixel602" "pixel603" "pixel604" "pixel605"
## [607] "pixel606" "pixel607" "pixel608" "pixel609" "pixel610" "pixel611"
## [613] "pixel612" "pixel613" "pixel614" "pixel615" "pixel616" "pixel617"
## [619] "pixel618" "pixel619" "pixel620" "pixel621" "pixel622" "pixel623"
## [625] "pixel624" "pixel625" "pixel626" "pixel627" "pixel628" "pixel629"
## [631] "pixel630" "pixel631" "pixel632" "pixel633" "pixel634" "pixel635"
## [637] "pixel636" "pixel637" "pixel638" "pixel639" "pixel640" "pixel641"
## [643] "pixel642" "pixel643" "pixel644" "pixel645" "pixel646" "pixel647"
## [649] "pixel648" "pixel649" "pixel650" "pixel651" "pixel652" "pixel653"
## [655] "pixel654" "pixel655" "pixel656" "pixel657" "pixel658" "pixel659"
## [661] "pixel660" "pixel661" "pixel662" "pixel663" "pixel664" "pixel665"
## [667] "pixel666" "pixel667" "pixel668" "pixel669" "pixel670" "pixel671"
## [673] "pixel672" "pixel673" "pixel674" "pixel675" "pixel676" "pixel677"
## [679] "pixel678" "pixel679" "pixel680" "pixel681" "pixel682" "pixel683"
## [685] "pixel684" "pixel685" "pixel686" "pixel687" "pixel688" "pixel689"
## [691] "pixel690" "pixel691" "pixel692" "pixel693" "pixel694" "pixel695"
## [697] "pixel696" "pixel697" "pixel698" "pixel699" "pixel700" "pixel701"
## [703] "pixel702" "pixel703" "pixel704" "pixel705" "pixel706" "pixel707"
## [709] "pixel708" "pixel709" "pixel710" "pixel711" "pixel712" "pixel713"
## [715] "pixel714" "pixel715" "pixel716" "pixel717" "pixel718" "pixel719"
## [721] "pixel720" "pixel721" "pixel722" "pixel723" "pixel724" "pixel725"
## [727] "pixel726" "pixel727" "pixel728" "pixel729" "pixel730" "pixel731"
## [733] "pixel732" "pixel733" "pixel734" "pixel735" "pixel736" "pixel737"
## [739] "pixel738" "pixel739" "pixel740" "pixel741" "pixel742" "pixel743"
## [745] "pixel744" "pixel745" "pixel746" "pixel747" "pixel748" "pixel749"
## [751] "pixel750" "pixel751" "pixel752" "pixel753" "pixel754" "pixel755"
## [757] "pixel756" "pixel757" "pixel758" "pixel759" "pixel760" "pixel761"
## [763] "pixel762" "pixel763" "pixel764" "pixel765" "pixel766" "pixel767"
## [769] "pixel768" "pixel769" "pixel770" "pixel771" "pixel772" "pixel773"
## [775] "pixel774" "pixel775" "pixel776" "pixel777" "pixel778" "pixel779"
## [781] "pixel780" "pixel781" "pixel782" "pixel783" "pixel784"
# Labels of the first 6 digits
head(mnist_sample[, 1])
## [1] 5 0 7 0 9 3
# Plot the histogram of the digit labels
hist(mnist_sample$label)

# Compute the basic statistics of all records
# summary(mnist_sample)

# Compute the basic statistics of digits with label 0
# summary(mnist_sample[mnist_sample$label==0,])


# Show the labels of the first 10 records
mnist_sample$label[1:10]
##  [1] 5 0 7 0 9 3 4 1 2 6
# Compute the Euclidean distance of the first 10 records
distances <- dist(mnist_sample[1:10, -1], method="euclidean")

# Show the distances values
distances
##       1    2    3    4    5    6    7    8    9
## 2  2186                                        
## 3  2656 2870                                   
## 4  2547 2341 2937                              
## 5  2407 2959 1976 2871                         
## 6  2344 2760 2453 2739 2126                    
## 7  2464 2784 2574 2871 2174 2654               
## 8  2150 2669 2000 2586 2067 2273 2408          
## 9  2959 3210 2935 3414 2871 3115 2981 2833     
## 10 2729 3010 2575 2833 2396 2656 2464 2550 2695
# Plot the numeric matrix of the distances in a heatmap
heatmap(as.matrix(distances), Rowv = NA, symm = TRUE, 
        labRow = mnist_sample$label[1:10], labCol = mnist_sample$label[1:10]
        )

# Minkowski distance or order 3
distances_3 <- dist(mnist_sample[1:10, -1], method="minkowski", p=3)
distances_3
##       1    2    3    4    5    6    7    8    9
## 2  1003                                        
## 3  1170 1229                                   
## 4  1127 1045 1250                              
## 5  1091 1260  941 1232                         
## 6  1064 1194 1104 1190  996                    
## 7  1098 1199 1131 1228 1006 1165               
## 8  1007 1169  951 1143  981 1056 1083          
## 9  1270 1337 1257 1401 1248 1319 1272 1237     
## 10 1187 1268 1134 1219 1085 1167 1096 1133 1181
heatmap(as.matrix(distances_3), Rowv = NA, symm = TRUE, 
        labRow = mnist_sample$label[1:10], labCol = mnist_sample$label[1:10]
        )

# Minkowski distance of order 2
distances_2 <- dist(mnist_sample[1:10, -1], method="minkowski", p=2)
distances_2
##       1    2    3    4    5    6    7    8    9
## 2  2186                                        
## 3  2656 2870                                   
## 4  2547 2341 2937                              
## 5  2407 2959 1976 2871                         
## 6  2344 2760 2453 2739 2126                    
## 7  2464 2784 2574 2871 2174 2654               
## 8  2150 2669 2000 2586 2067 2273 2408          
## 9  2959 3210 2935 3414 2871 3115 2981 2833     
## 10 2729 3010 2575 2833 2396 2656 2464 2550 2695
heatmap(as.matrix(distances_2), Rowv = NA, symm = TRUE, 
        labRow = mnist_sample$label[1:10], labCol = mnist_sample$label[1:10]
        )

# Get the first 10 records
mnist_10 <- mnist_sample[1:10, -1]

# Add 1 to avoid NaN when rescaling
mnist_10_prep <- mnist_10 + 1

# Compute the sums per row
sums <- rowSums(mnist_10_prep)

# Compute KL divergence
distances <- philentropy::distance(mnist_10_prep/sums, method="kullback-leibler")
## Metric: 'kullback-leibler' using unit: 'log'; comparing: 10 vectors.
heatmap(as.matrix(distances), Rowv = NA, symm = TRUE, 
        labRow = mnist_sample$label[1:10], labCol = mnist_sample$label[1:10]
        )

# Get the principal components from PCA
pca_output <- prcomp(mnist_sample[, -1])

# Observe a summary of the output
summary(pca_output)
## Importance of components:
##                             PC1      PC2      PC3      PC4     PC5
## Standard deviation     581.0793 511.1573 499.0990 440.1563 438.994
## Proportion of Variance   0.0964   0.0746   0.0712   0.0553   0.055
## Cumulative Proportion    0.0964   0.1711   0.2422   0.2975   0.353
##                             PC6      PC7     PC8     PC9     PC10     PC11
## Standard deviation     381.6497 360.3161 349.933 307.246 298.6138 286.2939
## Proportion of Variance   0.0416   0.0371   0.035   0.027   0.0255   0.0234
## Cumulative Proportion    0.3942   0.4313   0.466   0.493   0.5187   0.5421
##                            PC12     PC13     PC14    PC15     PC16
## Standard deviation     272.0886 256.8551 254.4171 251.229 233.2123
## Proportion of Variance   0.0211   0.0188   0.0185   0.018   0.0155
## Cumulative Proportion    0.5632   0.5821   0.6006   0.619   0.6341
##                            PC17    PC18     PC19     PC20     PC21
## Standard deviation     230.0690 221.486 217.2549 210.0747 200.5780
## Proportion of Variance   0.0151   0.014   0.0135   0.0126   0.0115
## Cumulative Proportion    0.6492   0.663   0.6767   0.6894   0.7008
##                            PC22     PC23     PC24     PC25     PC26
## Standard deviation     198.6126 193.2674 1.87e+02 180.4072 1.74e+02
## Proportion of Variance   0.0113   0.0107 9.96e-03   0.0093 8.67e-03
## Cumulative Proportion    0.7121   0.7228 7.33e-01   0.7420 7.51e-01
##                            PC27     PC28     PC29     PC30     PC31
## Standard deviation     1.73e+02 1.70e+02 1.66e+02 1.64e+02 1.62e+02
## Proportion of Variance 8.56e-03 8.21e-03 7.91e-03 7.66e-03 7.49e-03
## Cumulative Proportion  7.59e-01 7.67e-01 7.75e-01 7.83e-01 7.91e-01
##                            PC32     PC33     PC34     PC35     PC36
## Standard deviation     1.59e+02 1.54e+02 1.47e+02 1.45e+02 1.43e+02
## Proportion of Variance 7.21e-03 6.79e-03 6.18e-03 6.01e-03 5.81e-03
## Cumulative Proportion  7.98e-01 8.05e-01 8.11e-01 8.17e-01 8.23e-01
##                            PC37     PC38     PC39     PC40     PC41
## Standard deviation     1.41e+02 1.36e+02 1.32e+02 1.32e+02 1.26e+02
## Proportion of Variance 5.64e-03 5.26e-03 4.99e-03 4.96e-03 4.56e-03
## Cumulative Proportion  8.28e-01 8.33e-01 8.38e-01 8.43e-01 8.48e-01
##                            PC42     PC43     PC44     PC45     PC46
## Standard deviation     1.23e+02 1.21e+02 1.20e+02 1.19e+02 1.16e+02
## Proportion of Variance 4.35e-03 4.16e-03 4.14e-03 4.04e-03 3.82e-03
## Cumulative Proportion  8.52e-01 8.56e-01 8.61e-01 8.65e-01 8.68e-01
##                            PC47     PC48     PC49     PC50     PC51
## Standard deviation     1.13e+02 1.12e+02 1.11e+02 1.08e+02 1.07e+02
## Proportion of Variance 3.64e-03 3.55e-03 3.49e-03 3.36e-03 3.27e-03
## Cumulative Proportion  8.72e-01 8.76e-01 8.79e-01 8.82e-01 8.86e-01
##                            PC52     PC53     PC54     PC55     PC56
## Standard deviation     1.04e+02 1.03e+02 1.01e+02 98.41844 96.19388
## Proportion of Variance 3.12e-03 3.02e-03 2.89e-03  0.00277  0.00264
## Cumulative Proportion  8.89e-01 8.92e-01 8.95e-01  0.89755  0.90019
##                            PC57     PC58     PC59     PC60     PC61
## Standard deviation     96.03764 93.24621 92.09529 91.13565 90.33912
## Proportion of Variance  0.00263  0.00248  0.00242  0.00237  0.00233
## Cumulative Proportion   0.90283  0.90531  0.90773  0.91010  0.91244
##                            PC62    PC63     PC64     PC65     PC66
## Standard deviation     88.43374 87.8278 86.50593 86.13606 84.45834
## Proportion of Variance  0.00223  0.0022  0.00214  0.00212  0.00204
## Cumulative Proportion   0.91467  0.9169  0.91901  0.92113  0.92317
##                            PC67     PC68     PC69     PC70     PC71
## Standard deviation     83.33234 81.00884 80.23066 78.73474 77.59712
## Proportion of Variance  0.00198  0.00187  0.00184  0.00177  0.00172
## Cumulative Proportion   0.92515  0.92702  0.92886  0.93063  0.93235
##                           PC72     PC73     PC74     PC75    PC76     PC77
## Standard deviation     77.0849 76.17273 75.46779 75.10109 72.5529 71.73195
## Proportion of Variance  0.0017  0.00166  0.00163  0.00161  0.0015  0.00147
## Cumulative Proportion   0.9341  0.93571  0.93733  0.93894  0.9405  0.94192
##                            PC78    PC79     PC80     PC81    PC82     PC83
## Standard deviation     71.33942 69.9079 69.10822 68.50964 67.4226 67.22979
## Proportion of Variance  0.00145  0.0014  0.00136  0.00134  0.0013  0.00129
## Cumulative Proportion   0.94337  0.9448  0.94613  0.94747  0.9488  0.95006
##                            PC84    PC85     PC86     PC87     PC88    PC89
## Standard deviation     65.89167 64.8959 64.62287 63.33802 62.88913 61.9592
## Proportion of Variance  0.00124  0.0012  0.00119  0.00115  0.00113  0.0011
## Cumulative Proportion   0.95130  0.9525  0.95370  0.95484  0.95597  0.9571
##                            PC90     PC91     PC92     PC93   PC94     PC95
## Standard deviation     61.34202 61.21219 60.39063 59.89925 59.261 58.16600
## Proportion of Variance  0.00107  0.00107  0.00104  0.00102  0.001  0.00097
## Cumulative Proportion   0.95814  0.95921  0.96026  0.96128  0.962  0.96325
##                            PC96    PC97     PC98     PC99    PC100
## Standard deviation     57.24925 56.2413 55.55244 55.24024 54.05327
## Proportion of Variance  0.00094  0.0009  0.00088  0.00087  0.00083
## Cumulative Proportion   0.96419  0.9651  0.96597  0.96684  0.96768
##                           PC101   PC102    PC103    PC104    PC105
## Standard deviation     53.72387 53.0814 52.66387 52.23018 51.83450
## Proportion of Variance  0.00082  0.0008  0.00079  0.00078  0.00077
## Cumulative Proportion   0.96850  0.9693  0.97010  0.97088  0.97164
##                           PC106    PC107    PC108   PC109    PC110
## Standard deviation     51.12512 50.70059 50.02274 49.5971 49.21602
## Proportion of Variance  0.00075  0.00073  0.00071  0.0007  0.00069
## Cumulative Proportion   0.97239  0.97313  0.97384  0.9745  0.97523
##                           PC111    PC112    PC113    PC114   PC115   PC116
## Standard deviation     49.07307 47.75766 47.10337 46.87950 45.8291 45.6506
## Proportion of Variance  0.00069  0.00065  0.00063  0.00063  0.0006  0.0006
## Cumulative Proportion   0.97592  0.97657  0.97721  0.97784  0.9784  0.9790
##                           PC117    PC118    PC119    PC120    PC121
## Standard deviation     45.32725 44.80365 44.55733 43.60770 42.85722
## Proportion of Variance  0.00059  0.00057  0.00057  0.00054  0.00052
## Cumulative Proportion   0.97962  0.98019  0.98076  0.98130  0.98183
##                          PC122   PC123    PC124    PC125    PC126    PC127
## Standard deviation     41.9668 41.8655 41.13512 40.74071 40.50967 40.12422
## Proportion of Variance  0.0005  0.0005  0.00048  0.00047  0.00047  0.00046
## Cumulative Proportion   0.9823  0.9828  0.98331  0.98379  0.98425  0.98471
##                           PC128    PC129    PC130    PC131    PC132
## Standard deviation     39.84226 39.52123 38.97680 38.38749 38.17122
## Proportion of Variance  0.00045  0.00045  0.00043  0.00042  0.00042
## Cumulative Proportion   0.98517  0.98561  0.98605  0.98647  0.98689
##                           PC133   PC134   PC135    PC136    PC137    PC138
## Standard deviation     37.70782 37.6177 37.3220 36.55369 35.92438 35.51238
## Proportion of Variance  0.00041  0.0004  0.0004  0.00038  0.00037  0.00036
## Cumulative Proportion   0.98729  0.9877  0.9881  0.98848  0.98884  0.98920
##                           PC139    PC140    PC141    PC142    PC143
## Standard deviation     35.45001 35.15678 34.35638 34.06735 33.29135
## Proportion of Variance  0.00036  0.00035  0.00034  0.00033  0.00032
## Cumulative Proportion   0.98956  0.98992  0.99025  0.99058  0.99090
##                           PC144   PC145   PC146    PC147    PC148    PC149
## Standard deviation     33.01342 32.6682 32.1846 31.88599 31.66208 31.18541
## Proportion of Variance  0.00031  0.0003  0.0003  0.00029  0.00029  0.00028
## Cumulative Proportion   0.99121  0.9915  0.9918  0.99210  0.99239  0.99267
##                           PC150    PC151    PC152    PC153    PC154
## Standard deviation     30.79636 30.57366 30.32541 29.81812 29.72790
## Proportion of Variance  0.00027  0.00027  0.00026  0.00025  0.00025
## Cumulative Proportion   0.99294  0.99321  0.99347  0.99372  0.99397
##                           PC155    PC156    PC157    PC158    PC159
## Standard deviation     29.16075 28.45111 28.12558 28.00044 27.75274
## Proportion of Variance  0.00024  0.00023  0.00023  0.00022  0.00022
## Cumulative Proportion   0.99422  0.99445  0.99467  0.99490  0.99512
##                           PC160    PC161    PC162   PC163    PC164
## Standard deviation     27.32120 27.00107 26.79261 26.2524 25.99387
## Proportion of Variance  0.00021  0.00021  0.00021  0.0002  0.00019
## Cumulative Proportion   0.99533  0.99554  0.99574  0.9959  0.99613
##                           PC165    PC166    PC167    PC168    PC169
## Standard deviation     25.25316 24.90629 24.69333 24.30137 23.80281
## Proportion of Variance  0.00018  0.00018  0.00017  0.00017  0.00016
## Cumulative Proportion   0.99632  0.99649  0.99667  0.99684  0.99700
##                           PC170    PC171    PC172    PC173    PC174
## Standard deviation     23.76362 23.27539 22.91356 22.85981 22.55678
## Proportion of Variance  0.00016  0.00015  0.00015  0.00015  0.00015
## Cumulative Proportion   0.99716  0.99731  0.99746  0.99761  0.99776
##                           PC175    PC176    PC177    PC178    PC179
## Standard deviation     22.21152 21.76808 21.24362 21.11932 20.94527
## Proportion of Variance  0.00014  0.00014  0.00013  0.00013  0.00013
## Cumulative Proportion   0.99790  0.99804  0.99816  0.99829  0.99842
##                           PC180    PC181    PC182    PC183    PC184
## Standard deviation     20.46394 20.25800 20.19252 19.42600 19.32760
## Proportion of Variance  0.00012  0.00012  0.00012  0.00011  0.00011
## Cumulative Proportion   0.99854  0.99865  0.99877  0.99888  0.99898
##                          PC185    PC186    PC187    PC188    PC189
## Standard deviation     18.6953 18.21247 17.99321 17.53140 17.07084
## Proportion of Variance  0.0001  0.00009  0.00009  0.00009  0.00008
## Cumulative Proportion   0.9991  0.99918  0.99927  0.99936  0.99944
##                           PC190    PC191    PC192    PC193    PC194
## Standard deviation     16.78692 16.39348 16.20101 15.45278 14.49178
## Proportion of Variance  0.00008  0.00008  0.00007  0.00007  0.00006
## Cumulative Proportion   0.99952  0.99960  0.99968  0.99974  0.99980
##                           PC195    PC196   PC197    PC198   PC199    PC200
## Standard deviation     13.48301 12.72247 1.2e+01 11.41324 8.49323 2.58e-13
## Proportion of Variance  0.00005  0.00005 4.0e-05  0.00004 0.00002 0.00e+00
## Cumulative Proportion   0.99986  0.99990 1.0e+00  0.99998 1.00000 1.00e+00
# Store the first two coordinates and the label in a data frame
pca_plot <- data.frame(pca_x = pca_output$x[, 1], pca_y = pca_output$x[, 2], 
                       label = as.factor(mnist_sample$label)
                       )

# Plot the first two principal components using the true labels as color and shape
ggplot(pca_plot, aes(x = pca_x, y = pca_y, color = label)) + 
    ggtitle("PCA of MNIST sample") + 
    geom_text(aes(label = label)) + 
    theme(legend.position = "none")

tsne_output <- Rtsne::Rtsne(mnist_sample[, -1], PCA = FALSE, dims = 2)  # modifying the default parameters


# Explore the tsne_output structure
str(tsne_output)
## List of 14
##  $ N                  : int 200
##  $ Y                  : num [1:200, 1:2] -2.6 1.69 1.23 -7.73 5.02 ...
##  $ costs              : num [1:200] 0.00562 0.00481 0.000989 0.004309 0.003997 ...
##  $ itercosts          : num [1:20] 52 52.6 52.4 53.7 53.4 ...
##  $ origD              : int 50
##  $ perplexity         : num 30
##  $ theta              : num 0.5
##  $ max_iter           : num 1000
##  $ stop_lying_iter    : int 250
##  $ mom_switch_iter    : int 250
##  $ momentum           : num 0.5
##  $ final_momentum     : num 0.8
##  $ eta                : num 200
##  $ exaggeration_factor: num 12
# Have a look at the first records from the t-SNE output
head(tsne_output$Y)
##       [,1]   [,2]
## [1,] -2.60  4.919
## [2,]  1.69  8.931
## [3,]  1.23 -0.786
## [4,] -7.73 -9.117
## [5,]  5.02 -1.973
## [6,] -5.77  3.864
# Store the first two coordinates and the label in a data.frame
tsne_plot <- data.frame(tsne_x = tsne_output$Y[, 1], tsne_y = tsne_output$Y[, 2], 
                        label = as.factor(mnist_sample$label)
                        )

# Plot the t-SNE embedding using the true labels as color and shape
ggplot(tsne_plot, aes(x = tsne_x, y = tsne_y, color = label)) + 
    ggtitle("T-Sne output") + 
    geom_text(aes(label = label)) + 
    theme(legend.position = "none")


Chapter 2 - Introduction to t-SNE

Building a t-SNE Embedding:

  • The t-SNE method was published in 2008
    • Non-linear dimensionality reduction technique
    • Works well for most of the problems and is a very good method for visualizing high dimensional datasets
    • Rather than keeping dissimilar points apart (like PCA) it keeps the low-dimensional representation of similar points together
  • Generally, the t-SNE method starts with PCA, then moves on to additional steps
    • Use PCA to reduce the input dimensions into a small number
    • Construct a probability distribution over pairs of original high dimensional records
    • Define a similarity probability distribution of the points in the low-dimensional embedding
    • Minimize the K-L divergence between the two distributions using gradient descent method
  • Can run t-SNE in R using the Rtsne package
    • library(Rtsne)
    • tsne_output <- Rtsne(mnist[, -1])
    • tsne_output <- Rtsne(mnist[, -1], PCA = FALSE, dims = 3) # modifying the default parameters
    • tsne_output$itercosts # K-L divergence cost, samples after 50 iterations
    • head(tsne_output$costs) # cost of each record after the final iteration

Optimal Number of t-SNE Iterations:

  • Hyper-parameters are common in machine learning, and t-SNE has several of them
    • Number of iterations
    • Perplexity
    • Learning rate
    • Optimization criterium: K-L divergence
  • Because t-SNE is not deterministic, running with the same parameters can drive different results
    • set.seed(1234)
    • tsne_output_1 <- Rtsne(mnist[, -1], max_iter = 1500)
    • set.seed(1234)
    • tsne_output_2 <- Rtsne(mnist[, -1], max_iter = 1500)
    • identical(tsne_output_1, tsne_output_2) # TRUE
  • One of the important parameters for t-SNE is the number of iterations (default is 1000)

Effect of Perplexity Parameter:

  • Perplexity is a hyper-parameter for balancing global and local criteria
    • A guess about the number of close neighbors
    • In a real setting is important to try different values
    • Must be lower than the number of input records
    • tsne_output <- Rtsne(mnist[, -1], perplexity = 50, max_iter = 1300)

Classifying Digits with t-SNE:

  • Can classify digits (and run other tasks) using t-SNE
    • Build a t-SNE model and calculate the centroids
    • Classify unknown data points based on proximity to the centroid
  • Example for step 1 (building the t-SNE)
    • tsne <- Rtsne(mnist_10k[, -1], perplexity = 5)
    • tsne_plot <- data.frame(tsne_x= tsne_out\(Y[1:5000,1], tsne_y = tsne_out\)Y[1:5000,2], digit = as.factor(mnist_10k[1:5000,]$label))
    • ggplot(tsne_plot, aes(x= tsne_x, y = tsne_y, color = digit)) + ggtitle(“MNIST embedding of the first 5K digits”) + geom_text(aes(label = digit)) + theme(legend.position=“none”)
  • Example for step 1b (calculating the centroids)
    • centroids <- as.data.table(tsne_out$Y[1:5000,])
    • setnames(centroids, c(“X”, “Y”))
    • centroids[, label := as.factor(mnist_10k[1:5000,]$label)]
    • centroids[, mean_X := mean(X), by = label]
    • centroids[, mean_Y := mean(Y), by = label]
    • centroids <- unique(centroids, by = “label”)
    • ggplot(centroids, aes(x= mean_X, y = mean_Y, color = label)) + ggtitle(“Centroids coordinates”) + geom_text(aes(label = label)) + theme(legend.position = “none”)
  • Example for step 2 (classifying new digits)
    • distances <- as.data.table(tsne_out$Y[5001:10000,])
    • setnames(distances, c(“X”, “Y”))
    • distances[, label := mnist_10k[5001:10000,]$label]
    • distances <- distances[label == 4 | label == 9]
    • distances[, dist_4:=sqrt(((X - centroids[label==4,]\(mean_X) + (Y - centroids[label==4,]\)mean_Y))^2)]
    • ggplot(distances, aes(x=dist_4, fill = as.factor(label))) + geom_histogram(binwidth=5, alpha=.5, position=“identity”, show.legend = F)

Example code includes:

# Compute t-SNE without doing the PCA step
tsne_output <- Rtsne::Rtsne(mnist_sample[, -1], PCA = FALSE, dims = 3)

# Show the obtained embedding coordinates
head(tsne_output$Y)
##       [,1]   [,2]    [,3]
## [1,] -3.90  0.626  2.2402
## [2,]  9.86 -8.222  9.6127
## [3,] -8.74 -5.232  4.4015
## [4,] -9.58  9.981  3.3003
## [5,] -7.68 -3.214 -0.0745
## [6,] -7.74  2.807  7.7508
# Store the first two coordinates and plot them 
tsne_plot <- data.frame(tsne_x = tsne_output$Y[, 1], tsne_y = tsne_output$Y[, 2], 
                        digit = as.factor(mnist_sample$label)
                        )

# Plot the coordinates
ggplot(tsne_plot, aes(x = tsne_x, y = tsne_y, color = digit)) + 
ggtitle("t-SNE of MNIST sample") + 
geom_text(aes(label = digit)) + 
theme(legend.position = "none")

# Inspect the output object's structure
str(tsne_output)
## List of 14
##  $ N                  : int 200
##  $ Y                  : num [1:200, 1:3] -3.9 9.86 -8.74 -9.58 -7.68 ...
##  $ costs              : num [1:200] 0.00256 0.00461 0.00169 0.00222 0.0029 ...
##  $ itercosts          : num [1:20] 52.4 53.4 52.6 52.9 51.7 ...
##  $ origD              : int 50
##  $ perplexity         : num 30
##  $ theta              : num 0.5
##  $ max_iter           : num 1000
##  $ stop_lying_iter    : int 250
##  $ mom_switch_iter    : int 250
##  $ momentum           : num 0.5
##  $ final_momentum     : num 0.8
##  $ eta                : num 200
##  $ exaggeration_factor: num 12
# Show the K-L divergence of each record after the final iteration
tsne_output$itercosts
##  [1] 52.418 53.442 52.558 52.901 51.656  0.802  0.579  0.533  0.521  0.520
## [11]  0.517  0.515  0.513  0.511  0.510  0.510  0.508  0.507  0.504  0.503
tsne_output$costs
##   [1] 0.002556 0.004608 0.001691 0.002216 0.002898 0.004183 0.002754
##   [8] 0.001666 0.001401 0.002071 0.001997 0.002965 0.002408 0.002127
##  [15] 0.004021 0.001847 0.004202 0.001360 0.000661 0.002919 0.001963
##  [22] 0.002013 0.005414 0.002476 0.002521 0.002330 0.001597 0.003748
##  [29] 0.002867 0.001238 0.006646 0.002232 0.004471 0.002023 0.003331
##  [36] 0.003620 0.003690 0.003046 0.001068 0.003092 0.004347 0.002435
##  [43] 0.002224 0.001266 0.002469 0.003183 0.003292 0.005514 0.001522
##  [50] 0.003355 0.002538 0.003548 0.000973 0.003724 0.003210 0.005844
##  [57] 0.001530 0.003629 0.000501 0.001255 0.002963 0.001798 0.000946
##  [64] 0.001219 0.001717 0.000864 0.001107 0.001536 0.001697 0.001502
##  [71] 0.003486 0.004288 0.000579 0.004023 0.001357 0.003757 0.000792
##  [78] 0.002060 0.001243 0.001878 0.002739 0.002442 0.003554 0.001439
##  [85] 0.004956 0.001095 0.002383 0.002107 0.001548 0.002985 0.003448
##  [92] 0.000709 0.003105 0.001277 0.006636 0.004045 0.001300 0.001124
##  [99] 0.002436 0.001912 0.001468 0.003581 0.001166 0.001716 0.001500
## [106] 0.001844 0.001151 0.003541 0.002041 0.001340 0.001106 0.001858
## [113] 0.001699 0.001883 0.003128 0.003232 0.003683 0.003367 0.000254
## [120] 0.004959 0.002403 0.005469 0.000558 0.001905 0.004205 0.002494
## [127] 0.003512 0.003912 0.004192 0.003329 0.010506 0.002245 0.002350
## [134] 0.001941 0.001677 0.003609 0.004245 0.000986 0.002019 0.003203
## [141] 0.001641 0.002979 0.002684 0.001287 0.001096 0.003645 0.001900
## [148] 0.001790 0.002159 0.002323 0.000495 0.001795 0.001945 0.001819
## [155] 0.005875 0.004069 0.004675 0.003174 0.000603 0.000859 0.000715
## [162] 0.000791 0.003574 0.001305 0.001697 0.001860 0.001829 0.001890
## [169] 0.003916 0.002265 0.000837 0.001701 0.003167 0.002093 0.003039
## [176] 0.004440 0.001640 0.002899 0.004424 0.002667 0.001839 0.001788
## [183] 0.002397 0.001005 0.002155 0.000892 0.001803 0.002666 0.004028
## [190] 0.001752 0.003715 0.001889 0.002250 0.003404 0.001618 0.000842
## [197] 0.001877 0.003156 0.003224 0.001500
# Plot the K-L divergence of each record after the final iteration
plot(tsne_output$itercosts, type = "l")

plot(tsne_output$costs, type = "l")

# Generate a three-dimensional t-SNE embedding without PCA
tsne_output <- Rtsne::Rtsne(mnist_sample[, -1], PCA=FALSE, dims=3)

# Generate a new t-SNE embedding with the same hyper-parameter values
tsne_output_new <- Rtsne::Rtsne(mnist_sample[, -1], PCA=FALSE, dims=3)

# Check if the two outputs are identical
identical(tsne_output, tsne_output_new)
## [1] FALSE
# Generate a three-dimensional t-SNE embedding without PCA
set.seed(1234)
tsne_output <- Rtsne::Rtsne(mnist_sample[, -1], PCA = FALSE, dims = 3)

# Generate a new t-SNE embedding with the same hyper-parameter values
set.seed(1234)
tsne_output_new <- Rtsne::Rtsne(mnist_sample[, -1], PCA = FALSE, dims = 3)

# Check if the two outputs are identical
identical(tsne_output, tsne_output_new)
## [1] TRUE
# Set seed to ensure reproducible results
set.seed(1234)

# Execute a t-SNE with 2000 iterations
tsne_output <- Rtsne::Rtsne(mnist_sample[, -1], PCA=TRUE, dims=2, max_iter=2000)

# Observe the output costs 
tsne_output$itercosts
##  [1] 53.214 53.951 54.243 52.808 51.851  1.166  0.920  0.836  0.799  0.788
## [11]  0.779  0.771  0.768  0.766  0.763  0.739  0.731  0.722  0.722  0.715
## [21]  0.713  0.701  0.699  0.697  0.695  0.689  0.684  0.684  0.681  0.677
## [31]  0.676  0.676  0.675  0.675  0.674  0.674  0.674  0.674  0.674  0.676
# Get the 50th iteration with the minimum K-L cost
which.min(tsne_output$itercosts)
## [1] 39
# Set seed to ensure reproducible results
set.seed(1234)

# Execute a t-SNE with perplexity 5
tsne_output_5 <- Rtsne::Rtsne(mnist_sample[, -1], perplexity=5, max_iter=1200)

# Observe the returned K-L divergence costs at every 50th iteration
tsne_output_5$itercosts
##  [1] 75.889 76.215 74.568 75.288 74.557  2.479  1.575  1.309  1.137  0.999
## [11]  0.932  0.892  0.881  0.848  0.823  0.816  0.815  0.812  0.810  0.806
## [21]  0.804  0.798  0.783  0.765
# Set seed to ensure reproducible results
set.seed(1234)

# Execute a t-SNE with perplexity 20
tsne_output_20 <- Rtsne::Rtsne(mnist_sample[, -1], perplexity=20, max_iter=1200)

# Observe the returned K-L divergence costs at every 50th iteration
tsne_output_20$itercosts
##  [1] 56.897 57.863 57.047 57.774 58.292  1.375  1.055  0.917  0.895  0.888
## [11]  0.858  0.819  0.785  0.765  0.762  0.749  0.741  0.737  0.735  0.733
## [21]  0.733  0.733  0.725  0.726
# Set seed to ensure reproducible results
set.seed(1234)

# Execute a t-SNE with perplexity 50
tsne_output_50 <- Rtsne::Rtsne(mnist_sample[, -1], perplexity=50, max_iter=1200)

# Observe the returned K-L divergence costs at every 50th iteration
tsne_output_50$itercosts
##  [1] 45.839 45.866 45.343 46.623 46.922  0.929  0.698  0.597  0.583  0.571
## [11]  0.569  0.566  0.565  0.567  0.566  0.567  0.565  0.567  0.567  0.567
## [21]  0.565  0.567  0.568  0.569
# Observe the K-L divergence costs with perplexity 5 and 50
tsne_output_5$itercosts
##  [1] 75.889 76.215 74.568 75.288 74.557  2.479  1.575  1.309  1.137  0.999
## [11]  0.932  0.892  0.881  0.848  0.823  0.816  0.815  0.812  0.810  0.806
## [21]  0.804  0.798  0.783  0.765
tsne_output_50$itercosts
##  [1] 45.839 45.866 45.343 46.623 46.922  0.929  0.698  0.597  0.583  0.571
## [11]  0.569  0.566  0.565  0.567  0.566  0.567  0.565  0.567  0.567  0.567
## [21]  0.565  0.567  0.568  0.569
# Generate the data frame to visualize the embedding
tsne_plot_5 <- data.frame(tsne_x = tsne_output_5$Y[, 1], tsne_y = tsne_output_5$Y[, 2], digit = as.factor(mnist_sample$label))
tsne_plot_50 <- data.frame(tsne_x = tsne_output_50$Y[, 1], tsne_y = tsne_output_50$Y[, 2], digit = as.factor(mnist_sample$label))

# Plot the obtained embeddings
ggplot(tsne_plot_5, aes(x = tsne_x, y = tsne_y, color = digit)) + 
ggtitle("MNIST t-SNE with 1300 iter and Perplexity=5") + geom_text(aes(label = digit)) + 
theme(legend.position="none")

ggplot(tsne_plot_50, aes(x = tsne_x, y = tsne_y, color = digit)) + 
ggtitle("MNIST t-SNE with 1300 iter and Perplexity=50") + geom_text(aes(label = digit)) + 
theme(legend.position="none")

# Prepare the data.frame
tsne_plot <- data.frame(tsne_x = tsne_output_50$Y[1:100, 1], 
                        tsne_y = tsne_output_50$Y[1:100, 2], 
                        digit = as.factor(mnist_sample$label[1:100])
                        )

# Plot the obtained embedding
ggplot(tsne_plot, aes(x = tsne_x, y = tsne_y, color = digit)) + 
ggtitle("MNIST embedding of the first 100 digits") + 
geom_text(aes(label = digit)) + 
theme(legend.position="none")

# Get the first 5K records and set the column names
dt_prototypes <- as.data.table(tsne_output_50$Y[1:100, ])
setnames(dt_prototypes, c("X","Y"))

# Paste the label column as factor
dt_prototypes[, label := as.factor(mnist_sample[1:100,]$label)]

# Compute the centroids per label
dt_prototypes[, mean_X := mean(X), by = label]
dt_prototypes[, mean_Y := mean(Y), by = label]

# Get the unique records per label
dt_prototypes <- unique(dt_prototypes, by = "label")
dt_prototypes
##          X      Y label mean_X mean_Y
##  1: -0.307  0.168     5  0.702 -0.137
##  2: -2.547  5.049     0  0.624  0.715
##  3: -3.043  0.658     7 -3.946  0.142
##  4: -2.593 -0.343     9 -2.609 -2.543
##  5: -1.632 -0.530     3  2.309  1.000
##  6: -4.941  1.640     4 -2.938 -1.811
##  7: -0.975  1.433     1 -0.403  1.258
##  8:  2.874  5.434     2  1.153  4.649
##  9:  2.486 -4.996     6  2.321 -3.062
## 10: -0.808 -3.918     8 -0.545 -1.615
# Store the last 100 records in distances and set column names
distances <- as.data.table(tsne_output_50$Y[101:200, ])
setnames(distances, c("X", "Y"))

# Paste the true label
distances[, label := mnist_sample[101:200,]$label]

# Filter only those labels that are 1 or 0 
distances <- distances[label == 1 | label == 0]

# Compute Euclidean distance to prototype of digit 1
distances[, dist_1 := sqrt(( (X - dt_prototypes[label == 1,]$mean_X) + 
                             (Y - dt_prototypes[label == 1,]$mean_Y))^2)]


# Compute the basic statistics of distances from records of class 1
summary(distances[label == 1]$dist_1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.045   0.652   0.886   1.121   1.819   2.253
# Compute the basic statistics of distances from records of class 1
summary(distances[label == 0]$dist_1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.16    0.38    1.34    2.70    5.22    6.05
# Plot the histogram of distances of each class
ggplot(distances, aes(x = dist_1, fill = as.factor(label))) +
geom_histogram(binwidth = 5, alpha = .5, position = "identity", show.legend = FALSE) + 
ggtitle("Distribution of Euclidean distance 1 vs 0")


Chapter 3 - Using t-SNE with Predictive Models

Credit Card Fraud Detection:

  • There are benefits of using t-SNE for feature engineering for further analysis
    • Less correlation of input features
    • Reduction in computation time
  • Database of European credit card farud available from 2013
    • Released by Andrea Dal Pozzolo, et al. and available in Kaggle datasets
    • Highly unbalanced: 492 fraud cases out of 248,807 (0.172%)
    • Anonymized numerical features which are the result of a PCA
    • 30 features plus the Class (1 fraud, 0 not-fraud)
    • We only know the meaning of two features: time and amount of the transaction
  • Need to manage the class imbalances, for example by over-smapling the minority or under-sampling the majority
    • set.seed(1234)
    • idx <- sample(1:nrow(creditcard), nrow(creditcard)*.20)
    • creditcard.test <- creditcard[idx]
    • creditcard.train <- creditcard[!idx]
    • creditcard.pos <- creditcard.train[Class==1]
    • creditcard.neg <- creditcard.train[Class==0]
    • creditcard.neg.bal <- creditcard.neg[sample(1:nrow(creditcard.neg), nrow(creditcard.pos))]
    • creditcard.train <- rbind(creditcard.pos, creditcard.neg.bal)

Training Random Forest Models:

  • The random forest model can help with classification (widely used method that does not require as much fine-tuning of parameters)
    • randomForest the most common R package for random forests
    • library(randomForest)
    • train_x <- creditcard_train[, -31]
    • train_y <- creditcard_train$Class
    • rf_model <- randomForest(x = train_x, y = train_y, ntree = 100)
  • Can plot performance based on the number of trees
    • plot(rf_model, main = “Error evolution vs number of trees”)
    • legend(“topright”, colnames(rf_model$err.rate),col=1:3,cex=0.8,fill=1:3)
    • varImpPlot(rf_model, main = “Variable importance”)

Predicting Data:

  • Can make predictions using the random forest and evaluate the model
    • prop.table(table(creditcard_test$Class))
    • pred_rf <- predict(rf_model, creditcard_test, type = “prob”)
    • hist(pred_rf[, 2], main = “Histogram of predictions on the test set”, xlab = “prediction value”)
  • Can assess quality of predictions using AUC
    • pred <- prediction(pred_rf[,2], creditcard_test$Class)
    • perf <- performance(pred, measure = “auc”)
    • perf@y.values

Visualizing Neural Network Layers:

  • Neural networks can also be used for classification, with layers of neurons producing a final output
  • There are many types of activation functions - sigmoid, tanh, relu, leaky relu, etc.
  • Visualizing the weights for each neuron in each layer can be valuable, and dimensionality reduction helps with that
    • head(layer_128_train[, 1:7])
    • summary(layer_128_train[, 1:4])
    • tsne_nn_layer_train <- Rtsne(as.matrix(layer_128_train), perplexity = 50, max_iter = 400, check_duplicates = F, dims = 2, verbose = T)
    • tsne_plot_train <- data.frame(tsne_x = tsne_nn_layer_train\(Y[,1], tsne_y = tsne_nn_layer_train\)Y[,2], y_col = creditcard_train$Class)
    • ggplot(tsne_plot_train, aes(x = tsne_x, y = tsne_y, color = y_col)) + geom_point() + ggtitle(“Credit card embedding 128 neurons layer”) + theme(legend.position=“none”)

Example code includes:

# Look at the data dimensions
dim(creditcard)
## [1] 28923    31
# Explore the column names
names(creditcard)
##  [1] "Time"   "V1"     "V2"     "V3"     "V4"     "V5"     "V6"    
##  [8] "V7"     "V8"     "V9"     "V10"    "V11"    "V12"    "V13"   
## [15] "V14"    "V15"    "V16"    "V17"    "V18"    "V19"    "V20"   
## [22] "V21"    "V22"    "V23"    "V24"    "V25"    "V26"    "V27"   
## [29] "V28"    "Amount" "Class"
# Observe some records
str(creditcard)
## Classes 'data.table' and 'data.frame':   28923 obs. of  31 variables:
##  $ Time  : num  406 472 4462 6986 7519 ...
##  $ V1    : num  -2.31 -3.04 -2.3 -4.4 1.23 ...
##  $ V2    : num  1.95 -3.16 1.76 1.36 3.02 ...
##  $ V3    : num  -1.61 1.09 -0.36 -2.59 -4.3 ...
##  $ V4    : num  4 2.29 2.33 2.68 4.73 ...
##  $ V5    : num  -0.522 1.36 -0.822 -1.128 3.624 ...
##  $ V6    : num  -1.4265 -1.0648 -0.0758 -1.7065 -1.3577 ...
##  $ V7    : num  -2.537 0.326 0.562 -3.496 1.713 ...
##  $ V8    : num  1.3917 -0.0678 -0.3991 -0.2488 -0.4964 ...
##  $ V9    : num  -2.77 -0.271 -0.238 -0.248 -1.283 ...
##  $ V10   : num  -2.772 -0.839 -1.525 -4.802 -2.447 ...
##  $ V11   : num  3.202 -0.415 2.033 4.896 2.101 ...
##  $ V12   : num  -2.9 -0.503 -6.56 -10.913 -4.61 ...
##  $ V13   : num  -0.5952 0.6765 0.0229 0.1844 1.4644 ...
##  $ V14   : num  -4.29 -1.69 -1.47 -6.77 -6.08 ...
##  $ V15   : num  0.38972 2.00063 -0.69883 -0.00733 -0.33924 ...
##  $ V16   : num  -1.141 0.667 -2.282 -7.358 2.582 ...
##  $ V17   : num  -2.83 0.6 -4.78 -12.6 6.74 ...
##  $ V18   : num  -0.0168 1.7253 -2.6157 -5.1315 3.0425 ...
##  $ V19   : num  0.417 0.283 -1.334 0.308 -2.722 ...
##  $ V20   : num  0.12691 2.10234 -0.43002 -0.17161 0.00906 ...
##  $ V21   : num  0.517 0.662 -0.294 0.574 -0.379 ...
##  $ V22   : num  -0.035 0.435 -0.932 0.177 -0.704 ...
##  $ V23   : num  -0.465 1.376 0.173 -0.436 -0.657 ...
##  $ V24   : num  0.3202 -0.2938 -0.0873 -0.0535 -1.6327 ...
##  $ V25   : num  0.0445 0.2798 -0.1561 0.2524 1.4889 ...
##  $ V26   : num  0.178 -0.145 -0.543 -0.657 0.567 ...
##  $ V27   : num  0.2611 -0.2528 0.0396 -0.8271 -0.01 ...
##  $ V28   : num  -0.1433 0.0358 -0.153 0.8496 0.1468 ...
##  $ Amount: num  0 529 240 59 1 ...
##  $ Class : chr  "1" "1" "1" "1" ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Generate a summary
# summary(creditcard)

# Plot a histogram of the transaction time
ggplot(creditcard, aes(x = Time)) + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Extract positive and negative instances of fraud
creditcard_pos <- creditcard[Class == 1]
creditcard_neg <- creditcard[Class == 0]

# Fix the seed
set.seed(1234)

# Create a new negative balanced dataset by undersampling
creditcard_neg_bal <- creditcard_neg[sample(1:nrow(creditcard_neg), nrow(creditcard_pos)), ]

# Generate a balanced train set
creditcard_train <- rbind(creditcard_pos, creditcard_neg_bal)


# Fix the seed
set.seed(1234)

# Separate x and y sets
train_x <- creditcard_train[, -31]
train_y <- as.factor(creditcard_train$Class)

# Train a random forests
rf_model <- randomForest::randomForest(train_x, train_y, ntree=100)

# Plot the error evolution and variable importance
plot(rf_model)

randomForest::varImpPlot(rf_model)

# Set the seed
set.seed(1234)

# Generate the t-SNE embedding 
tsne_output <- Rtsne::Rtsne(as.matrix(creditcard_train[, -31]), check_duplicates = FALSE, PCA=FALSE)

# Generate a data frame to plot the result
tsne_plot <- data.frame(tsne_x = tsne_output$Y[, 1],
                        tsne_y = tsne_output$Y[, 2],
                        Class = creditcard_train$Class
                        )

# Plot the embedding usign ggplot and the label
ggplot(tsne_plot, aes(x = tsne_x, y = tsne_y, color = Class)) + 
    ggtitle("t-SNE of credit card fraud train set") + 
    geom_text(aes(label = Class)) + theme(legend.position = "none")

# Fix the seed
set.seed(1234)

# Train a random forest
rf_model_tsne <- randomForest::randomForest(tsne_plot[, c("tsne_x", "tsne_y")], 
                                            as.factor(creditcard_train$Class), ntree=100
                                            )

# Plot the error evolution
plot(rf_model_tsne)

# Plot the variable importance
randomForest::varImpPlot(rf_model_tsne)

# Predict on the test set using the random forest 
# pred_rf <- predict(rf_model, creditcard_test, type = "prob")

# Plot a probability distibution of the target class
# hist(pred_rf[, 2])

# Compute the area under the curve
# pred <- prediction(pred_rf[, 2], creditcard_test$Class)
# perf <- performance(pred, measure = "auc") 
# perf@y.values


# Predict on the test set using the random forest generated with t-SNE features
# pred_rf <- predict(rf_model_tsne, test_x, type = "prob")

# Plot a probability distibution of the target class
# hist(pred_rf[, 2])

# Compute the area under the curve
# pred <- prediction(pred_rf[, 2], creditcard_test$Class)
# perf <- performance(pred, measure="auc") 
# perf@y.values


# Observe the dimensions
# dim(layer_128_train)

# Show the first six records of the last ten columns
# head(layer_128_train[, 118:128])

# Generate a summary of all columns
# summary(layer_128_train)


# Set the seed
# set.seed(1234)

# Generate the t-SNE
# tsne_output <- Rtsne(as.matrix(layer_128_train), check_duplicates=FALSE, max_iter=400, perplexity=50)

# Prepare data.frame
# tsne_plot <- data.frame(tsne_x = tsne_output$Y[, 1], tsne_y = tsne_output$Y[, 2], 
#                         Class = creditcard_train$Class
#                         )

# Plot the data 
# ggplot(tsne_plot, aes(x = tsne_x, y = tsne_y, color = Class)) + 
#     geom_point() + 
#     ggtitle("Credit card embedding of Last Neural Network Layer")

Chapter 4 - Generalized Low Rank Models

Exploring Fashion MNIST dataset:

  • The fashion MNIST dataset contains 70,000 grayscale images at 28x28 pixels of 10 clothing categories
    • Identical format to traditional MNIST
    • Released by Zalando
    • With the goal of replacing MNIST, because:
    • MNIST is easy to predict
    • MNIST is overused
  • Exploring and modeling with the fashion MNIST data
    • class_names <- c(‘T-shirt/top’, ‘Trouser’, ‘Pullover’, ‘Dress’, ‘Coat’, ‘Sandal’, ‘Shirt’, ‘Sneaker’, ‘Bag’, ‘Ankle boot’)
    • xy_axis <- data.frame(x = expand.grid(1:28, 28:1)[,1], y = expand.grid(1:28, 28:1)[,2])
    • plot_data <- cbind(xy_axis, fill = as.data.frame(t(fashion_mnist[1, -1]))[,1])
    • plot_theme <- list( raster = geom_raster(hjust = 0, vjust = 0), gradient_fill = scale_fill_gradient(low = “white”, high = “black”, guide = FALSE), theme = theme(axis.line = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank(), panel.background = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.background = element_blank()) )
    • ggplot(plot_data, aes(x, y, fill = fill)) + ggtitle(class_names[as.integer(fashion_mnist[1,1])+1]) + plot_theme

Generalized Low Rank Models (GLRM):

  • There are many benefits to the GLRM approach, including
    • Reduces the required storage
    • Enables data visualization
    • Removes noise
    • Imputes missing data
    • Simplifies data processing
  • The low-rank structure converts an mxn matrix to a combination of mxk and kxn
    • Parallelized dimensionality reduction algorithm
    • Categorical columns are transformed into binary columns
  • Can run GLRM in R with H2O
    • H2O is an open source machine learning framework with R interfaces
    • Has a good parallel implementation of GLRM
    • Steps: (1) initialize the cluster and (2) store the input data
    • h2o.init()
    • fashion_mnist.hex <- as.h2o(fashion_mnist, “fashion_mnist.hex”)
    • model_glrm <- h2o.glrm(training_frame = fashion_mnist.hex, cols = 2:ncol(fashion_mnist), k = 2, max_iterations = 100) # k is the rank (dimension) of the space

Visualizing a GLRM Model:

  • Often helpful to extract the XY representation of the GLRM
    • X <- as.data.table(h2o.getFrame(model_glrm@model$representation_name)) # dim will be nxk
    • Y <- model_glrm@model$archetypes # dim will be kxm
    • ggplot(X, aes(x= Arch1, y = Arch2, color = fashion_mnist\(label)) + ggtitle("Fashion Mnist GLRM Archetypes") + geom_text(aes(label = fashion_mnist\)label)) + theme(legend.position=“none”)
  • Can grab the centroids by class, and use as prototypes for classification
    • X[, label := as.numeric(fashion_mnist$label)]
    • X[, mean_x := mean(Arch1), by = label]
    • X[, mean_y := mean(Arch2), by = label]
    • X_mean <- unique(X, by = “label”)
    • class_names = c(‘T-shirt/top’, ‘Trouser’, ‘Pullover’, ‘Dress’, ‘Coat’, ‘Sandal’, ‘Shirt’, ‘Sneaker’, ‘Bag’, ‘Ankle boot’)
    • ggplot(X_mean, aes(x = mean_x, y = mean_y, color = as.factor(X_mean$label))) + ggtitle(“Fashion Mnist GLRM class centroids”) + geom_text(aes(label = class_names[label])) + theme(legend.position=“none”)
  • Can reconstruct the original data using X*Y, with the predict method
    • fashion_pred <- predict(model_glrm, fashion_mnist.hex)
    • head(fashion_pred[1:2, 1:4])
    • xy_axis <- data.frame(x = expand.grid(1:28,28:1)[,1], y = expand.grid(1:28,28:1)[,2])
    • data_reconstructed <- cbind(xy_axis, fill = as.data.frame(t(fashion_pred[1000,]))[,1])
    • plot_reconstructed <- ggplot(plot_data, aes(x, y, fill = fill)) + ggtitle(“Reconstructed Pullover (K=2)”) + plot_theme
    • data_original <- cbind(xy_axis, fill = as.data.frame(t(fashion_mnist[1000, -1]))[,1])
    • plot_original <- ggplot(plot_data_2, aes(x, y, fill = fill)) + ggtitle(“Original Pullover”) + plot_theme
    • grid.arrange(plot_reconstructed, plot_original, nrow = 1)
  • Higher values of k will typically lead to better reconstructions - assessed visually or by using the resulting error values

Dealing with Missing Data and Speeding-Up Models:

  • Missing data can confound the analysis, and a good approach to it is required
  • Common in real-world datasets
    • May be Intentionally not provided
    • May be Due to an error
    • With GLRM we can impute missing data and assign an estimation
  • Example of randomly generated missing data using the fashion_mnist dataset
    • fashion_mnist_miss.hex <- h2o.insertMissingValues(fashion_mnist.hex[,-1], fraction=0.2, seed = 1234)
    • summary(fashion_mnist_miss[,781:784])
    • model_glrm <- h2o.glrm(training_frame = fashion_mnist_miss.hex, transform = “NORMALIZE”, ignore_const_cols = FALSE, k = 64, max_iterations = 200, seed = 123)
    • fashion_pred <- h2o.predict(model_glrm, fashion_mnist_miss.hex)
    • summary(fashion_pred[,782:784])
  • Another advantage of GLRM is to speed up training models (lower volumes of data for processing)
    • Training machine learning models is faster using a low-dimensional representation
    • Key is to have a good compressed representation
  • Example of training a random forest
    • Trained several h2o random forests, 4-Fold Cross-Validation
    • Fashion MNIST (60.000) was compressed with GLRM and changing the value of K from 2 to 256
    • We measure the accuracy and the required time
    • perf_metrics

Summary and Wrap-Up:

  • Can use t-SNE and GLRM to reduce dimensionality
  • Simplify data processing, enhace visualizing data, etc.

Example code includes:

# Show the dimensions
dim(fashion_mnist)
## [1] 500 785
# Create a summary of the last five columns 
summary(fashion_mnist[, 780:785])
##     pixel779        pixel780        pixel781        pixel782    
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:  0.0   1st Qu.:  0.0   1st Qu.:  0.0   1st Qu.:  0.0  
##  Median :  0.0   Median :  0.0   Median :  0.0   Median :  0.0  
##  Mean   : 23.5   Mean   : 18.5   Mean   :  7.1   Mean   :  2.3  
##  3rd Qu.:  0.0   3rd Qu.:  1.0   3rd Qu.:  0.0   3rd Qu.:  0.0  
##  Max.   :224.0   Max.   :233.0   Max.   :204.0   Max.   :171.0  
##     pixel783       pixel784
##  Min.   : 0.0   Min.   :0  
##  1st Qu.: 0.0   1st Qu.:0  
##  Median : 0.0   Median :0  
##  Mean   : 0.5   Mean   :0  
##  3rd Qu.: 0.0   3rd Qu.:0  
##  Max.   :77.0   Max.   :0
# Table with the class distribution
table(fashion_mnist$label)
## 
##  0  1  2  3  4  5  6  7  8  9 
## 42 44 50 49 53 52 59 54 46 51
xy_axis <- data.frame(x=rep(1:28, times=28), y=rep(28:1, each=28))
plot_theme <- list( raster = geom_raster(hjust = 0, vjust = 0), 
                    gradient_fill = scale_fill_gradient(low = "white", high = "black", guide = FALSE), 
                    theme = theme(axis.line = element_blank(), axis.text = element_blank(), 
                                  axis.ticks = element_blank(), axis.title = element_blank(),
                                  panel.background = element_blank(), panel.border = element_blank(),
                                  panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
                                  plot.background = element_blank()
                                  )
                    )  
class_names <- c('T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 
                 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'
                 )  

# Get the data from the last image
plot_data <- cbind(xy_axis, fill = as.data.frame(t(fashion_mnist[500, -1]))[,1])

# Observe the first records
head(plot_data)
##   x  y fill
## 1 1 28    0
## 2 2 28    0
## 3 3 28    0
## 4 4 28    0
## 5 5 28    0
## 6 6 28    0
# Plot the image using ggplot()
ggplot(plot_data, aes(x, y, fill = fill)) + 
    ggtitle(class_names[as.integer(fashion_mnist[500, 1])]) + 
    plot_theme 

# Start a connection with the h2o cluster
h2o::h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         15 days 23 hours 
##     H2O cluster timezone:       America/Chicago 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.22.1.1 
##     H2O cluster version age:    8 months and 1 day !!! 
##     H2O cluster name:           H2O_started_from_R_Dave_bvu150 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   4.12 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.3 (2019-03-11)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (8 months and 1 day)!
## Please download and install the latest version from http://h2o.ai/download/
# Store the data into h2o cluster
fashion_mnist.hex <- h2o::as.h2o(fashion_mnist, "fashion_mnist.hex")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
# Launch a GLRM model over fashion_mnist data
model_glrm <- h2o::h2o.glrm(training_frame = fashion_mnist.hex, cols = 2:ncol(fashion_mnist), 
                            k = 2, seed = 123, max_iterations = 100
                            )
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [pixel58, pixel29, pixel1, pixel28, pixel784].
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |=================================================================| 100%
# Plotting the convergence
plot(model_glrm)

# Launch a GLRM model with normalized fashion_mnist data  
model_glrm <- h2o::h2o.glrm(training_frame = fashion_mnist.hex, transform = "NORMALIZE",
                            cols = 2:ncol(fashion_mnist), k = 2, seed = 123, max_iterations = 100
                            )
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [pixel58, pixel29, pixel1, pixel28, pixel784].
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |=================================================================| 100%
# Plotting the convergence
plot(model_glrm)

X_matrix <- as.data.table(h2o::h2o.getFrame(model_glrm@model$representation_name))
## Found more than one class "textConnection" in cache; using the first, from namespace 'RJSONIO'
## Also defined by 'BiocGenerics'
## Found more than one class "textConnection" in cache; using the first, from namespace 'RJSONIO'
## Also defined by 'BiocGenerics'
## Found more than one class "textConnection" in cache; using the first, from namespace 'RJSONIO'
## Also defined by 'BiocGenerics'
# Dimension of X_matrix
dim(X_matrix)
## [1] 500   2
# First records of X_matrix
head(X_matrix)
##      Arch1  Arch2
## 1: -0.0247  0.198
## 2:  0.5459  0.459
## 3: -0.0673 -0.597
## 4: -0.5037  0.443
## 5: -0.1740 -0.156
## 6:  0.5486  0.279
# Plot the records in the new two dimensional space
ggplot(as.data.table(X_matrix), aes(x= Arch1, y = Arch2, color = fashion_mnist$label)) + 
    ggtitle("Fashion Mnist GLRM Archetypes") + 
    geom_text(aes(label = fashion_mnist$label)) + 
    theme(legend.position="none")

# Store the label of each record and compute the centroids
X_matrix[, label := as.numeric(fashion_mnist$label)]
X_matrix[, mean_x := mean(Arch1), by = label]
X_matrix[, mean_y := mean(Arch2), by = label]

# Get one record per label and create a vector with class names
X_mean <- unique(X_matrix, by = "label")
label_names <- c("T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", 
                 "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"
                 )

# Plot the centroids
ggplot(X_mean, aes(x = mean_x, y = mean_y, color = as.factor(label))) + 
    ggtitle("Fashion Mnist GLRM class centroids") + 
    geom_text(aes(label = label_names[label])) +
    theme(legend.position="none")

makeNA <- function(x) {
    vecNA <- sort(unique(sample(1:length(x), round(0.225*length(x)), replace=TRUE)))
    x[vecNA] <- NA
    return(x)
}

fashion_mnist_miss <- fashion_mnist %>%
    select(-label) %>%
    apply(1, FUN=makeNA)


# Store the input data in h2o
fashion_mnist_miss.hex <- h2o::as.h2o(fashion_mnist_miss, "fashion_mnist_miss.hex")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
# Build a GLRM model
model_glrm <- h2o::h2o.glrm(training_frame = fashion_mnist_miss.hex, transform="NORMALIZE", 
                            k=2, max_iterations=100
                            )
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |=================================================================| 100%
# Impute missing values
fashion_pred <- h2o::h2o.predict(model_glrm, fashion_mnist_miss.hex)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
# Observe the statistics of the first 5 pixels
summary(fashion_pred[, 1:5])
## Warning in summary.H2OFrame(fashion_pred[, 1:5]): Approximated quantiles
## computed! If you are interested in exact quantiles, please pass the
## `exact_quantiles=TRUE` parameter.
##  reconstr_V1          reconstr_V2          reconstr_V3        
##  Min.   :-0.3249504   Min.   :-0.3457294   Min.   :-0.186293  
##  1st Qu.:-0.2372956   1st Qu.:-0.2526312   1st Qu.:-0.136938  
##  Median : 0.0287444   Median : 0.0295671   Median : 0.011318  
##  Mean   : 0.0007149   Mean   : 0.0001298   Mean   :-0.004211  
##  3rd Qu.: 0.2117430   3rd Qu.: 0.2241115   3rd Qu.: 0.113663  
##  Max.   : 0.3585177   Max.   : 0.3801826   Max.   : 0.196304  
##  reconstr_V4         reconstr_V5        
##  Min.   :-0.501134   Min.   :-0.179736  
##  1st Qu.:-0.364382   1st Qu.:-0.130077  
##  Median : 0.050674   Median : 0.022016  
##  Mean   : 0.007238   Mean   : 0.006123  
##  3rd Qu.: 0.336974   3rd Qu.: 0.126495  
##  Max.   : 0.565161   Max.   : 0.209747
# Get the starting timestamp
time_start <- proc.time()

# Train the random forest
rf_model <- randomForest::randomForest(x = fashion_mnist[, -1], y = fashion_mnist$label, ntree = 20)

# Get the end timestamp
time_end <- timetaken(time_start)

# Show the error and the time
rf_model$err.rate[20]
## [1] 0.302
time_end
## [1] "0.460s elapsed (0.450s cpu)"
# Get the starting timestamp
# time_start <- proc.time()

# Train the random forest
# rf_model <- randomForest(x = train_x, y = train_y, ntree = 500)

# Get the end timestamp
# time_end <- timetaken(time_start)

# Show the error and the time
# rf_model$err.rate[500]
# time_end

Optimizing R Code with Rcpp

Chapter 1 - Introduction

Introduction:

  • R is an interpretative language which can lead to slow run-times
  • C++ is a compiled language, making it much faster at the expense of requiring compiling (harder to learn and write)
  • The Rcpp package simplifies the process of using C++ from R
    • Introduction - basic C++ syntax
    • C++ functions and control flow
    • Vector classes
    • Case studies
  • Can use the library(microbenchmark) to help see the processing times of various code snippets
    • library(microbenchmark)
    • x <- rnorm(1e6)
    • microbenchmark( slowmax(x), max(x) ) # slowmax was written in a very inefficient manner
  • Since it is a compiled language, C++ typically does not have console capbilities, though Rcpp build in some helper functions for this
    • evalCpp( “40 + 2” )
    • evalCpp( “exp(1.0)” )
    • evalCpp( “std::numeric_limits::max()”)
  • Basic number types differ between R and C++
    • Literal numbers are doubles in R, and require the L to cast as integers (32 is a double, 32L is an integer)
    • Literal numbers are integers in C++, and require a trailing .0 to cast as doubles (32 is an integer, 32.0 is a double)
  • Can explicitly cast numbers between double and integer in C++
    • y <- evalCpp( “(double)(40 + 2)” )
    • evalCpp( “13 / 4” ) # Integer division, will result in 3
    • evalCpp( “(double)13 / 4” ) # Casted to float division if either operand is a double, will result in 3.25

Inline Functions with cppFunction:

  • Can define C++ functions using Rcpp, either scripted or using the R console
    • library(Rcpp)
    • cppFunction(“int fun(){
    • int x = 37 ;
    • return x ;
    • }" )
    • fun()
  • There are many languages of engineering that happen automatically behind the scenes using Rcpp
  • Variables in C++ are statically typed (contrast to R which is dynamically typed), meaning they may not undergo a change of type at any time
    • Functions must declare the types of all inputs and outputs, allowing the compiler to optimize code
  • Example functions using Rcpp
    • cppFunction(“double add( double x, double y){
    • double res = x + y ;
    • return res ;
    • }
    • )
    • add( 30, 12 )
    • See below for the equivalent R code
    • addr <- function(x, y) {
    • res <- x + y
    • res
    • }

Debugging:

  • Light debugging includes print outs as the loop runs or message printing at key points
  • The Rprintf() functions in Rcpp allows for printing to the screen
    • cppFunction( ’int fun(){
    • // Some values
    • int x = 42 ;
    • // Printing a message to the R console
    • Rprintf( “some message in the console, x=%d\n”, x ) ;
    • // Return some int
    • return 76 ;
    • }
    • ’)
  • Integer placeholders are %d while string placeholders are %s
  • Error handling in C++ allows for checking that key parameters are inside a defined key range
    • cppFunction( ’int fun(int x){
    • // A simple error message
    • if( x < 0 ) stop( “sorry x should be positive” ) ;
    • // A formatted error message
    • if( x > 20 ) stop( “x is too big (x=%d)”, x ) ;
    • // Return some int
    • return x ;
    • }’)

Example code includes:

# Load microbenchmark
library(microbenchmark)
library(Rcpp)


# Define the function sum_loop
sum_loop <- function(x) {
  result <- 0
  for (i in x) result <- result + i
  result
}

x <- rnorm(100000)

# Check for equality 
all.equal(sum_loop(x), sum(x))

# Compare the performance
microbenchmark(sum_loop = sum_loop(x), R_sum = sum(x))


# Evaluate 2 + 2 in C++
x <- evalCpp("2+2")

# Evaluate 2 + 2 in R
y <- 2+2

# Storage modes of x and y
storage.mode(x)
storage.mode(y)

# Change the C++ expression so that it returns a double
z <- evalCpp("2.0 + 2")


# Evaluate 17 / 2 in C++
evalCpp("17/2")

# Cast 17 to a double and divide by 2
evalCpp("(double)17/2")

# Cast 56.3 to an int
evalCpp("(int)56.3")


# Define the function the_answer()
cppFunction('
  int the_answer() {
    return 42 ;
  }
')

# Check the_answer() returns the integer 42
the_answer() == 42L


# Define the function euclidean_distance()
cppFunction('
  double euclidean_distance(double x, double y) {
    return sqrt(x*x + y*y) ;
  }
')

# Calculate the euclidean distance
euclidean_distance(1.5, 2.5)


# Define the function add()
cppFunction('
  int add(int x, int y) {
    int res = x + y ;
    Rprintf("** %d + %d = %d\\n", x, y, res) ;
    return res ;
  }
')

# Call add() to print THE answer
add(40, 2)


cppFunction('
  // adds x and y, but only if they are positive
  int add_positive_numbers(int x, int y) {
      // if x is negative, stop
      if( x < 0 ) stop("x is negative") ;
    
      // if y is negative, stop
      if( y < 0 ) stop("y is negative") ;
     
      return x + y ;
  }
')

# Call the function with positive numbers
add_positive_numbers(2, 3)

# Call the function with a negative number
add_positive_numbers(-2, 3)

Chapter 2 - Functions and Control Flow

C++ Functions Belong to C++ Files:

  • Can use .cpp files to save and source C++ functions
    • sourceCpp( “code.cpp” )
    • timesTwo( 21 )
  • To write a .cpp file, the following structure should be included
    • include <Rcpp.h>

    • using namespace Rcpp ;
    • // [[Rcpp::export]]
    • int timesTwo( int x ){
    • return 2*x ;
    • }
  • Note that the return requires a semicolon afterwards
  • Note that // [[Rcpp::export]] is a comment to C++ but is also picked up as meaningfull by Rcpp (defines that the following functions should be exported)

Writing Functions in C++:

  • Only the exported functions by way of //[[Rcpp::export]] are available to R; the others are internal (private) to the C++ session
  • Can run single-line comments using // and multi-line comments using /* */
  • Rcpp includes a special comment that embeds R code to the C++ file
    • /*** R <insert R code /*
  • The if-else syntax for Rcpp is very similar to the base R syntax
    • if( condition ){
    • // code if true
    • } else {
    • // code otherwise
    • }
  • Can also have a void function that is called only for side effects such as printing
    • // [[Rcpp::export]]
    • void info( double x){
    • if( x < 0 ){
    •   Rprintf( "x is negative" ) ;  
    • } else if( x == 0 ){
    •   Rprintf( "x is zero" ) ;  
    • } else if( x > 0 ){
    •   Rprintf( "x is positive" ) ;  
    • } else {
    •   Rprintf( "x is not a number" ) ;  
    • }
    • }

For Loops:

  • There are four components to a C++ for loop, and the process has meaningful differences from R for loops
    • Initialization - happens once, at the very beginning of the for loop
    • Continue condition - Logical condition to control if the loop continues
    • Increment - Executed at the end of each iteration (often to add 1 to an index)
    • Body - statements to be executed at each iteration
  • Example of a very typicaly for loop using C++ (note that i++ is shorthand for “increment I by 1”
    • for (int i=0; i<n; i++ ){
    • // some code using i
    • }
  • Example function to calculate the sum of the first n integers
    • // [[Rcpp::export]]
    • int nfirst( int n ){
    • if( n < 0 ) {
    •   stop( "n must be positive, I see n=%d", n ) ;
    • }
    • int result = 0 ;
    • for( int i=0; i<n; i++){
    •   if( i == 13 ){  
    •       Rprintf( "I cannot handle that, I am superstitious" ) ;  
    •       break ;  
    •   }  
    •   result = result + (i+1) ;  
    • }
    • return result ;
    • }
  • Example of the iterative approach from Newton to calculating square roots (see example code below)

While Loops:

  • While loop is conceptually simpler, and there is only a continue condition and a body of the function
  • The for loop from above can be conceptually re-written as a while loop
    • init
    • while( condition ){
    • body
    • increment
    • }
  • Can also run the do-while loop, where the condition comes after the body (particularly useful when the body must be run at least once)
    • do {
    • body
    • } while( condition ) ;

Example code includes:

# file should be included as 'script.cpp')
# file called as sourceCpp('script.cpp')


#include <Rcpp.h>
using namespace Rcpp ; 

// Export the function to R
//[[Rcpp::export]]
double twice(double x) {
    // Fix the syntax error
    return x+x;
}


// Include the Rcpp.h header
#include <Rcpp.h>

// Use the Rcpp namespace
using namespace Rcpp;

// [[Rcpp::export]]
int the_answer() {
    // Return 42
    return 42;
}

/*** R
# Call the_answer() to check you get the right result
the_answer()
*/


#include <Rcpp.h>
using namespace Rcpp; 

// Make square() accept and return a double
double square(double x) {
  // Return x times x
  return x*x ;
}

// [[Rcpp::export]]
double dist(double x, double y) {
  // Change this to use square()
  return sqrt(square(x) + square(y));
}


#include <Rcpp.h>
using namespace Rcpp; 

double square(double x) {
  return x * x ;
}

// [[Rcpp::export]]
double dist(double x, double y) {
  return sqrt(square(x) + square(y));
}

// Start the Rcpp R comment block
/*** R
# Call dist() to the point (3, 4)
dist(3, 4)
# Close the Rcpp R comment block
*/


#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
double absolute(double x) {
  // Test for x greater than zero
  if(x > 0) {
    // Return x
    return x; 
  // Otherwise
  } else {
    // Return negative x
    return -x;
  }
}

/*** R  
absolute(pi)
absolute(-3)
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double sqrt_approx(double value, int n) {
    // Initialize x to be one
    double x = 1;
    // Specify the for loop
    for(int i = 0; i < n; i++) {
        x = (x + value / x) / 2.0;
    }
    return x;
}

/*** R
sqrt_approx(2, 10)
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List sqrt_approx(double value, int n, double threshold) {
    double x = 1.0;
    double previous = x;
    bool is_good_enough = false;
    int i;
    for(i = 0; i < n; i++) {
        previous = x;
        x = (x + value / x) / 2.0;
        is_good_enough = fabs(previous - x) < threshold;
        
        // If the solution is good enough, then "break"
        if(is_good_enough) break;
    }
    return List::create(_["i"] = i , _["x"] = x);
}

/*** R
sqrt_approx(2, 1000, 0.1)
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double sqrt_approx(double value, double threshold) {
    double x = 1.0;
    double previous = x;
    bool is_good_enough = false;
    
    // Specify the while loop
    while(is_good_enough == false) {
        previous = x;
        x = (x + value / x) / 2.0;
        is_good_enough = fabs(x - previous) < threshold;
    }
    
    return x ;
}

/*** R
sqrt_approx(2, 0.00001)
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double sqrt_approx(double value, double threshold) {
    double x = 1.0;
    double previous = x;
    bool is_good_enough = false;
    
    // Initiate do while loop
    do {
        previous = x;
        x = (x + value / x) / 2.0;
        is_good_enough = fabs(x - previous) < threshold;
    // Specify while condition
    } while (is_good_enough == false);
    
    return x;
}

/*** R
sqrt_approx(2, 0.00001)
*/

Chapter 3 - Vector Classes

Rcpp Classes and Vectors:

  • Rcpp defines a number of C++ classes that greatly simplify processing and interfacing between C++ and R
    • NumericVector to manipulate numeric vectors, e.g. c(1,2,3)
    • IntegerVector for integer e.g. 1:3
    • LogicalVector for logical e.g. c(TRUE, FALSE)
    • CharacterVector for strings e.g. c(“a”, “b”, “c”)
    • List for lists, aka vectors of arbitrary R objects
  • There is an API for vectors, which allows for running some key methods
    • x.size() gives the number of elements of the vector x
    • x[i] gives the element on the ith position in the vector x
    • Indexing in C++ starts at 0. The index is an offset to the first position
  • Example pseudo-code for looping around a vector
    • // x comes from somewhere
    • NumericVector x = … ;
    • int n = x.size() ;
    • for( int i=0; i<n; i++){
    • // manipulate x[i]
    • }

Creating Vectors:

  • Exported Rcpp classes are meant to be called from R
  • Because C++ is a type-sepcific language, Rcpp attempts to coerce types to the needed type prior to passing the data to C++
  • Example for creating a vector of a given size in Rcpp (numeric vectors are initialized to 0 and string vectors are initialized to blank strings)
    • // [[Rcpp::export]]
    • NumericVector ones(int n){
    • // create a new numeric vector of size n
    • NumericVector x(n) ;
    • // manipulate it
    • for( int i=0; i<n; i++){
    •   x[i] = 1 ;  
    • }
    • return x ;
    • }
  • Can override the default initialization values for a new vector by passing an additional argument
    • double value = 42.0 ;
    • int n = 20 ;
    • // create a numeric vector of size 20
    • // with all values set to 42
    • NumericVector x( n, value ) ;
  • Can use a static method to initialize the class (???)
    • NumericVector x = NumericVector::create( 1, 2, 3 ) ;
    • CharacterVector s = CharacterVector::create( “pink”, “blue” ) ;
    • NumericVector x = NumericVector::create( [“a”] = 1, [“b”] = 2, _[“c”] = 3 ) ;
    • IntegerVector y = IntegerVector::create( [“d”] = 4, 5, 6, [“f”] = 7 ) ;
  • Can also clone vectors to avoid changing the original vector (creates a “deep copy”)
    • // [[Rcpp::export]]
    • NumericVector positives( NumericVector x ){
    • // clone x into y
    • NumericVector y = clone(x) ;
    • for( int i=0; i< y.size(); i++){
    •   if( y[i] < 0 ) y[i] = 0 ;  
    • }
    • return y ;
    • }

Weighted Mean:

  • Example for using Rcpp to run the weighted means
    • weighted_mean_R <- function(x, w){ sum(x*w) / sum(w) } # R code version
  • Example of very inefficient R code
    • weighted_mean_loop <- function(x, w){
    • total_xw <- 0
    • total_w <- 0
    • for( i in seq_along(x)){
    •   total_xw <- total_xw + x[i]*w[i]  
    •   total_w  <- total_w  + w[i]  
    • }
    • total_xw / total_w
    • }
  • Translation to C++ code using Rcpp
    • // [[Rcpp::export]]
    • double weighted_mean_cpp( NumericVector x, NumericVector w){
    • double total_xw = 0.0 ;
    • double total_w = 0.0 ;
    • int n = ___ ;
    • for( ___ ; ___ ; ___ ){
    •   // accumulate into total_xw and total_w  
    • }
    • return total_xw / total_w ;
    • }
  • Missing values need to be tested based on special functions (similar to R)
    • Each type of vector has its own special missing values
    • bool test = NumericVector::is_na(x) ;
    • double y = NumericVector::get_na() ; // The representation of NA in double

Vectors From the STL:

  • Rcpp vectors are thin wrappers around R vectors
    • Cannot (cost effectively) change size: data copy every time
  • STL (Standard Template Library) vectors are independent of R vectors
    • Cheap to grow and shrink: amortized copies
  • Generally, growing vectors is very expensive and should be avoided (in R or C++)
    • Can even be more efficient to have two functions; first finds the vector size, second creates vector of that size and then fills it

Example code includes:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double first_plus_last(NumericVector x) {
    // The size of x
    int n = x.size();
    // The first element of x
    double first = x[0];
    // The last element of x
    double last = x[n-1];
    return first + last;
}

/*** R
x <- c(6, 28, 496, 8128)
first_plus_last(x)
# Does the function give the same answer as R?
all.equal(first_plus_last(x), x[1] + x[4])
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double sum_cpp(NumericVector x) {
  // The size of x
  int n = x.size();
  // Initialize the result
  double result = 0;
  // Complete the loop specification
  for(int i = 0; i<n; i++) {
    // Add the next value
    result = result + x[i];
  }
  return result;
}

/*** R
set.seed(42)
x <- rnorm(1e6)
sum_cpp(x)
# Does the function give the same answer as R's sum() function?
all.equal(sum_cpp(x), sum(x))
*/


#include <Rcpp.h>
using namespace Rcpp;

// Set the return type to IntegerVector
// [[Rcpp::export]]
IntegerVector seq_cpp(int lo, int hi) {
  int n = hi - lo + 1;
    
  // Create a new integer vector, sequence, of size n
  IntegerVector sequence(n);
    
  for(int i = 0; i < n; i++) {
    // Set the ith element of sequence to lo plus i
    sequence[i] = lo + i;
  }
  
  return sequence;
}

/*** R
lo <- -2
hi <- 5
seq_cpp(lo, hi)
# Does it give the same answer as R's seq() function?
all.equal(seq_cpp(lo, hi), seq(lo, hi))
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List create_vectors() {
  // Create an unnamed character vector
  CharacterVector polygons = CharacterVector::create("triangle", "square", "pentagon");
  // Create a named integer vector
  IntegerVector mersenne_primes = IntegerVector::create(_["first"] = 3, _["second"] = 7, _["third"] = 31);
  // Create a named list
  List both = List::create(_["polygons"] = polygons, _["mersenne_primes"] = mersenne_primes);
  return both;
}

/*** R
create_vectors()
*/


# Unlike R, C++ uses a copy by reference system, meaning that if you copy a variable then make changes to the copy, the changes will also take place in the original.
# To prevent this behavior, you have to use the clone() function to copy the underlying data from the original variable into the new variable
# The syntax is y = clone(x). In this exercise, we have defined two functions for you:
# change_negatives_to_zero(): Takes a numeric vector, modifies by replacing negative numbers with zero, then returns both the original vector and the copy.
# change_negatives_to_zero_with_cloning(): Does the same thing as above, but clones the original vector before modifying it.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List change_negatives_to_zero(NumericVector the_original) {
  // Set the copy to the original
  NumericVector the_copy = the_original;
  int n = the_original.size();
  for(int i = 0; i < n; i++) {
    if(the_copy[i] < 0) the_copy[i] = 0;
  }
  return List::create(_["the_original"] = the_original, _["the_copy"] = the_copy);
}

// [[Rcpp::export]]
List change_negatives_to_zero_with_cloning(NumericVector the_original) {
  // Clone the original to make the copy
  NumericVector the_copy = clone(the_original);
  int n = the_original.size();
  for(int i = 0; i < n; i++) {
    if(the_copy[i] < 0) the_copy[i] = 0;
  }
  return List::create(_["the_original"] = the_original, _["the_copy"] = the_copy);
}

/*** R
x <- c(0, -4, 1, -2, 2, 4, -3, -1, 3)
change_negatives_to_zero_with_cloning(x)
change_negatives_to_zero(x)
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double weighted_mean_cpp(NumericVector x, NumericVector w) {
  // Initialize these to zero
  double total_w = 0.0;
  double total_xw = 0.0;
  
  // Set n to the size of x
  int n = x.size();
  
  // Specify the for loop arguments
  for(int i = 0; i<n; i++) {
    // Add ith weight
    total_w += w[i];
    // Add the ith data value times the ith weight
    total_xw += w[i]*x[i];
  }
  
  // Return the total product divided by the total weight
  return total_xw / total_w;
}

/*** R 
x <- c(0, 1, 3, 6, 2, 7, 13, 20, 12, 21, 11)
w <- 1 / seq_along(x)
weighted_mean_cpp(x, w)
# Does the function give the same results as R's weighted.mean() function?
all.equal(weighted_mean_cpp(x, w), weighted.mean(x, w))
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double weighted_mean_cpp(NumericVector x, NumericVector w) {
  double total_w = 0;
  double total_xw = 0;
  
  int n = x.size();
  
  for(int i = 0; i < n; i++) {
    // If the ith element of x or w is NA then return NA
    if (NumericVector::is_na(x[i]) | NumericVector::is_na(w[i])) return NumericVector::get_na();
    total_w += w[i];
    total_xw += x[i] * w[i];
  }
  
  return total_xw / total_w;
}

/*** R 
x <- c(0, 1, 3, 6, 2, 7, 13, NA, 12, 21, 11)
w <- 1 / seq_along(x)
weighted_mean_cpp(x, w)
*/


NumericVector bad_select_positive_values_cpp(NumericVector x) {
  NumericVector positive_x(0);
  for(int i = 0; i < x.size(); i++) {
    if(x[i] > 0) {
      positive_x.push_back(x[i]);
    }
  }
  return positive_x;
}

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector good_select_positive_values_cpp(NumericVector x) {
  int n_elements = x.size();
  int n_positive_elements = 0;
  
  // Calculate the size of the output
  for(int i = 0; i < n_elements; i++) {
    // If the ith element of x is positive
    if(x[i] > 0) {
      // Add 1 to n_positive_elements
      n_positive_elements++;
    }
  }
  
  // Allocate a vector of size n_positive_elements
  NumericVector positive_x(n_positive_elements);
  
  // Fill the vector
  int j = 0;
  for(int i = 0; i < n_elements; i++) {
    // If the ith element of x is positive
    if(x[i] > 0) {
      // Set the jth element of positive_x to the ith element of x
      positive_x[j] = x[i];
      // Add 1 to j
      j++;
    }
  }
  return positive_x;
}

/*** R
set.seed(42)
x <- rnorm(1e4)
# Does it give the same answer as R?
all.equal(good_select_positive_values_cpp(x), x[x > 0])
# Which is faster?
microbenchmark(
  bad_cpp = bad_select_positive_values_cpp(x),
  good_cpp = good_select_positive_values_cpp(x)
)
*/


# The standard template library (stl) is a C++ library containing flexible algorithms and data structures
# For example, the double vector from the stl is like a "native C++" equivalent of Rcpp's NumericVector
# The following code creates a standard double vector named x with ten elements
std::vector<double> x(10);
# Usually it makes more sense to stick with Rcpp vector types because it gives you access to many convenience methods that work like their R equivalents, including mean(), round(), and abs()
# However, the stl vectors have an advantage that they can dynamically change size without paying for data copy every time


#include <Rcpp.h>
using namespace Rcpp;

// Set the return type to a standard double vector
// [[Rcpp::export]]
std::vector<double> select_positive_values_std(NumericVector x) {
  int n = x.size();
  
  // Create positive_x, a standard double vector
  std::vector<double> positive_x(0);
    
  for(int i = 0; i < n; i++) {
    if(x[i] > 0) {
      // Append the ith element of x to positive_x
      positive_x.push_back(x[i]);
    }
  }
  return positive_x;
}

/*** R
set.seed(42)
x <- rnorm(1e6)
# Does it give the same answer as R?
all.equal(select_positive_values_std(x), x[x > 0])
# Which is faster?
microbenchmark(
  good_cpp = good_select_positive_values_cpp(x),
  std = select_positive_values_std(x)
)
*/

Chapter 4 - Case Studies

Random Number Generation:

  • Can use functions from the R namespace for features like drawing random numbers from a distribution
    • // one number from a N(0,1) - function only gives a single number
    • double x = R::rnorm( 0, 1 ) ;
  • Can also use the functions of the same name available in the Rcpp namespace - Rcpp::rnorm() and the like
  • Example for creating numbers from a truncated distribution (e.g., a normal with only positive values) - rejected sampling
    • // we generate n numbers
    • NumericVector x(n) ;
    • // fill the vector in a loop
    • for( int i=0; i<n; i++){
    • // keep generating d until it gets positive
    • double d ;
    • do {
    •   d = ... ;  
    • } while( d < 0 ) ;
    • x[i] = d ;
    • }
  • Example for creating numbers from a mixture of distributions
    • int component( NumericVector weights, double total_weight ){
    • // return the index of the selected component
    • }
    • NumericVector rmix( int n, NumericVector weights, NumericVector means, NumericVector sds ){
    • NumericVector res(n) ;
    • for( int i=0; i<n; i++){
    •   // find which component to use  
    •   ...  
    •   // simulate using the mean and sd from the selected component  
    •   ...  
    • }
    • return res ;
    • }

Rolling Operations:

  • Rolling means are vectors where the value at position j is the mean of the next n elements
    • The process runs much faster if just the first element is deleted and then the last element is added
  • Using C++ code can help make processes like these easy to read while maintaining very strong performance
  • Example of running last observation carried forward (LOCF)
    • iterative version (can be converted to C++ code)

    • na_meancf2 <- function(x){
    • total <- 0
    • n <- 0
    • for( i in seq_along(x) ){
    •   if( is.na(x[i]) ){  
    •       x[i] <- total / n  
    •   } else {  
    •       total <- x[i] + total  
    •       n <- n + 1  
    •   }  
    • }
    • }

Auto-Regressive Model:

  • Example for AR code in R (works well, but is inefficient)
    • ar <- function(n, phi, sd){
    • x <- epsilon <- rnorm(n, sd = sd)
    • np <- length(phi)
    • for( i in seq(np+1, n)){
    •   x[i] <- sum(x[seq(i-1, i-np)] * phi) + epsilon[i]  
    • }
    • x
    • }
  • Example skeleton for writing the same code using C++
    • NumericVector x(n) ;
    • // initial loop
    • for( ___ ; __ < np ; ___ ){
    • x[i] = R::rnorm(___) ;
    • }
    • // outer loop
    • for( ___ ; ___ ; ___ ){
    • double value = rnorm(___) ;
    • // inner loop
    • for( ___ ; ___ ; ___ ){
    •   value += ___ ;  
    • }
    • x[i] = value ;
    • }
  • Similar example for the R code for an MA (moving average)
    • ma <- function(n, theta, sd){
    • epsilon <- rnorm(n, sd = sd)
    • x <- numeric(n)
    • nq <- length(theta)
    • for( i in seq(nq+1, n)){
    •   x[i] <- sum(epsilon[seq(i-1, i-nq)] * theta) + epsilon[i]  
    • }
    • x
    • }
  • Example of running the MA (skeleton) using C++
    • include <Rcpp.h>

    • using namespace Rcpp ;
    • // [[Rcpp::export]]
    • NumericVector ma( int n, double mu, NumericVector theta, double sd ){
    • int nq = theta.size() ;
    • // generate the noise vector at once
    • // using the Rcpp::rnorm function, similar to the R function
    • NumericVector eps = Rcpp::rnorm(n, 0.0, sd) ;
    • // init the output vector of size n with all 0.0
    • NumericVector x(___) ;
    • // start filling the values at index nq + 1
    • for( int i=nq+1; i<n; i++){
    •   ____  
    • }
    • return x ;
    • }
  • Can also combine AR-MA in to an ARMA model

Wrap Up:

  • The Rcpp package combines the ease of R syntax with the speed and efficiency of C++
  • Vectorized code is basically a loop in a compiled language such as C++
  • Recap of key topics from this course - for loops in C++ are an extremely common use case
    • evalCpp and cppFunction
    • for loops in C++ - for( init ; condition ; increment ){ body }
    • Vectors, including that vector indexing starts at 0
  • C++ files with Rcpp
    • include <Rcpp.h>

    • using namespace Rcpp ;
    • // [[Rcpp::export]]
    • double add( double x, double y){
    • return x + y ;
    • }
    • // [[Rcpp::export]]
    • double twice( double x){
    • return 2.0 * x;
    • }

Example code includes:

# When you write R code, it usually makes sense to generate random numbers in a vectorized fashion
# When you are in C++ however, you are allowed (even by your guilty conscience) to use loops and process the data element by element

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector positive_rnorm(int n, double mean, double sd) {
    // Specify out as a numeric vector of size n
    NumericVector out(n);
    // This loops over the elements of out
    for(int i = 0; i < n; i++) {
        // This loop keeps trying to generate a value
        do {
            // Call Rs rnorm()
            out[i] = R::rnorm(mean, sd);
            // While the number is negative, keep trying
        } while(out[i] < 0);
    }
    return out;
}

/*** R
positive_rnorm(10, 2, 2)
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
int choose_component(NumericVector weights, double total_weight) {
    // Generate a uniform random number from 0 to total_weight
    double x = R::runif(0, total_weight);
    // Remove the jth weight from x until x is small enough
    int j = 0;
    while(x >= weights[j]) {
        // Subtract jth element of weights from x
        x -= weights[j];
        j++;
    }
    return j;
}

/*** R
weights <- c(0.3, 0.7)
# Randomly choose a component 5 times
replicate(5, choose_component(weights, sum(weights)))
*/


#include <Rcpp.h>
using namespace Rcpp;

// From previous exercise; do not modify
// [[Rcpp::export]]
int choose_component(NumericVector weights, double total_weight) {
    double x = R::runif(0, total_weight);
    int j = 0;
    while(x >= weights[j]) {
        x -= weights[j];
        j++;
    }
    return j;
}

// [[Rcpp::export]]
NumericVector rmix(int n, NumericVector weights, NumericVector means, NumericVector sds) {
    // Check that weights and means have the same size
    int d = weights.size();
    if(means.size() != d) {
        stop("means size != weights size");
    }
    // Do the same for the weights and std devs
    if(sds.size() != d) {
        stop("sds size != weights size");
    }
    // Calculate the total weight
    double total_weight = 0.0;
    for (int i=0; i<d; i++) { 
        total_weight += weights[i];
    };
    // Create the output vector
    NumericVector res(n);
    // Fill the vector
    for(int i = 0; i < n; i++) {
        // Choose a component
        int j = choose_component(weights, total_weight);
        // Simulate from the chosen component
        res[i] = R::rnorm(means[j], sds[j]);
    }
    return res;
}

/*** R
weights <- c(0.3, 0.7)
means <- c(2, 4)
sds <- c(2, 4)
rmix(10, weights, means, sds)
*/


# Complete the definition of rollmean3()
rollmean3 <- function(x, window = 3) {
    # Add the first window elements of x
    initial_total <- sum(head(x, window))
    # The elements to add at each iteration
    lasts <- tail(x, - window)
    # The elements to remove
    firsts <- head(x, - window)
    # Take the initial total and add the 
    # cumulative sum of lasts minus firsts
    other_totals <- initial_total + cumsum(lasts - firsts)
    # Build the output vector
    c(rep(NA, window - 1), # leading NA
      initial_total / window, # initial mean
      other_totals / window   # other means
      )
}

# From previous step; do not modify
rollmean3 <- function(x, window = 3) {
    initial_total <- sum(head(x, window))   
    lasts <- tail(x, - window)
    firsts <- head(x, - window)
    other_totals <- initial_total + cumsum(lasts - firsts)
    c(rep(NA, window - 1), initial_total / window, other_totals / window)
}

# This checks rollmean1() and rollmean2() give the same result
all.equal(rollmean1(x), rollmean2(x))

# This checks rollmean1() and rollmean3() give the same result
all.equal(rollmean1(x), rollmean3(x))

# Benchmark the performance
microbenchmark(rollmean1(x), rollmean2(x), rollmean3(x), times = 5)


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector rollmean4(NumericVector x, int window) {
    int n = x.size();
    // Set res as a NumericVector of NAs with length n
    NumericVector res(n, NumericVector::get_na());
    // Sum the first window worth of values of x
    double total = 0.0;
    for(int i = 0; i < window; i++) {
        total += x[i];
    }
    // Treat the first case seperately
    res[window - 1] = total / window;
    // Iteratively update the total and recalculate the mean 
    for(int i = window; i < n; i++) {
        // Remove the (i - window)th case, and add the ith case
        total += - x[i-window] + x[i];
        // Calculate the mean at the ith position
        res[i] = total / window;
    }
    return res;  
}

/*** R
# Compare rollmean2, rollmean3 and rollmean4   
set.seed(42)
x <- rnorm(1e4)
microbenchmark(rollmean2(x, 4), rollmean3(x, 4), rollmean4(x, 4), times = 5)   
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector na_locf2(NumericVector x) {
    // Initialize to NA
    double current = NumericVector::get_na();
    int n = x.size();
    NumericVector res = no_init(n);
    for(int i = 0; i < n; i++) {
        // If ith value of x is NA
        if(NumericVector::is_na(x[i])) {
            // Set ith result as current
            res[i] = current
        } else {
            // Set current as ith value of x
            current = x[i];
            res[i] = x[i]
        }
    }
    return res ;
}

/*** R
library(microbenchmark)
set.seed(42)
x <- rnorm(1e5)
# Sprinkle some NA into x
x[sample(1e5, 100)] <- NA  
microbenchmark(na_locf1(x), na_locf2(x), times = 5)
*/


#include <Rcpp.h>
using namespace Rcpp; 

// [[Rcpp::export]]
NumericVector na_meancf2(NumericVector x) {
    double total_not_na = 0.0;
    double n_not_na = 0.0;
    NumericVector res = clone(x);
    int n = x.size();
    for(int i = 0; i < n; i++) {
        // If ith value of x is NA
        if(NumericVector::is_na(x[i])) {
            // Set the ith result to the total of non-missing values 
            // divided by the number of non-missing values
            res[i] = total_not_na / n_not_na;
        } else {
            // Add the ith value of x to the total of non-missing values
            total_not_na += x[i];
            // Add 1 to the number of missing values
            n_not_na ++;
        }
    }
    return res;
}

/*** R
library(microbenchmark)
set.seed(42)
x <- rnorm(1e5)
x[sample(1e5, 100)] <- NA  
microbenchmark(na_meancf1(x), na_meancf2(x), times = 5)
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector ar2(int n, double c, NumericVector phi, double eps) {
    int p = phi.size();
    NumericVector x(n);
    // Loop from p to n
    for(int i = p; i < n; i++) {
        // Generate a random number from the normal distribution
        double value = R::rnorm(c, eps);
        // Loop from zero to p
        for(int j = 0; j < p; j++) {
            // Increase by the jth element of phi times 
            // the "i minus j minus 1"th element of x
            value += phi[j] * x[i-j-1];
        }
        x[i] = value;
    }
    return x;
}

/*** R
d <- data.frame(x = 1:50, y = ar2(50, 10, c(1, -0.5), 1))
ggplot(d, aes(x, y)) + 
    geom_line()
*/


#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
NumericVector ma2( int n, double mu, NumericVector theta, double sd ){
    int q = theta.size(); 
    NumericVector x(n);
    // Generate the noise vector
    NumericVector eps = rnorm(n, 0.0, sd);
    // Loop from q to n
    for(int i = q; i < n; i++) {
        // Value is mean plus noise
        double value = mu + eps[i];
        // Loop from zero to q
        for(int j = 0; j < q; j++) {
            // Increase by the jth element of theta times
            // the "i minus j minus 1"th element of eps
            value += theta[j] * eps[i - j - 1];
        }
        // Set ith element of x to value
        x[i] = value;
    }
    return x ;
}

/*** R
d <- data.frame(x = 1:50, y = ma2(50, 10, c(1, -0.5), 1))
ggplot(d, aes(x, y)) + 
    geom_line()
*/


#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector arma(int n, double mu, NumericVector phi, NumericVector theta, double sd) {
    int p = phi.size();
    int q = theta.size();
    NumericVector x(n);
    // Generate the noise vector
    NumericVector eps = rnorm(n, 0.0, sd);
    // Start at the max of p and q plus 1
    int start = std::max(p, q) + 1;
    // Loop i from start to n
    for(int i = start; i < n; i++) {
        // Value is mean plus noise
        double value = mu + eps[i];
        // The MA(q) part
        for(int j = 0; j < q; j++) {
            // Increase by the jth element of theta times
            // the "i minus j minus 1"th element of eps
            value += theta[j] * eps[i - j - 1];
        }
        // The AR(p) part
        for(int j = 0; j < p; j++) {
            // Increase by the jth element of phi times
            // the "i minus j minus 1"th element of x
            value += phi[j] * x[i - j - 1];
        }
        x[i] = value;
    }
    return x;
}

/*** R
d <- data.frame(x = 1:50, y = arma(50, 10, c(1, -0.5), c(1, -0.5), 1))
ggplot(d, aes(x, y)) + 
    geom_line()
*/

Regression Modeling in R: Case Studies

Chapter 1 - GLMs

Before Starting:

  • Course focuses on using GLM and Mixed Effects Models in case studies
  • First dataset is data(dragonflies)
    • head(dragonflies)
  • How does stream velocity influence the number of dragonflies present?
    • Response Variable: What are we trying to explain? (y-axis) - abundance
    • Predictor: What do we think is going to influence our response variable? (x-axis) - stream_flow
  • Can use ggplot2 to investigate the data prior to starting modeling
    • ggplot(dragonflies) + geom_histogram(aes(x = abundance))
    • ggplot(dragonflies) + geom_point(aes(x = stream_flow, y = abundance))
  • Overview of key steps prior to starting modeling
    • What are the data?
    • What is the research question?
    • What are the variables of interest?
    • What do the raw data look like?

Introduction to Generalized Linear Models (GLM):

  • The GLM is an ordinary linear model in its simplest form, but can be more complex when needed based on the data, distributions, and research questions
    • glm(response ~ predictor, data, family = “gaussian”)
    • gaussian_glm <- glm(abundance ~ stream_flow, data = dragonflies, family = “gaussian”)
  • Can generate predicted values and bisually assess fits and predictions
    • pred_df <- data.frame(stream_flow = seq(from = 1, to = 5, length = 10))
    • pred_df$predicted <- predict(gaussian_glm, pred_df)
    • ggplot(dragonflies) + geom_point(aes(x = stream_flow, y = abundance)) + geom_line(aes(x = stream_flow, y = predicted), data = pred_df)
  • Important to check residuals - heterogeneity
    • diag <- data.frame(residuals = resid(gaussian_glm), fitted = fitted(gaussian_glm))
    • ggplot(diag) + geom_point(aes(x = fitted, y = residuals)

Poisson GLM:

  • Linear models with Gaussian residuals can produce negative values (which may be inappropriate) and heterogeneity (assumption violation)
  • The Poisson GLM is a flexible extension to the GLM that can be especially valuable for count data
    • Apply a Poisson distribution to the error structure of the model
    • Applicable to count data
    • Use the link function log()
  • Example code for running the Poisson GLM
    • poisson_glm <- glm(abundance ~ stream_flow, data = dragonflies, family = “poisson”)
    • pred_df <- data.frame(stream_flow = seq(from = 1, to = 5, length = 10))
    • pred_df$predicted <- predict(poisson_glm, pred_df, type = “response”)
    • ggplot(dragonflies) + geom_point(aes(x = stream_flow, y = abundance)) + geom_line(aes(x = stream_flow, y = predicted), data = pred_df)
    • diag <- data.frame(residuals = resid(poisson_glm), fitted = fitted(poisson_glm))
    • ggplot(diag) + geom_point(aes(x = fitted, y = residuals))
  • Can check for over-dispersion with the Poisson GLM model
    • The ratio between residual deviance and degrees of freedom should be 1
    • Overdispersion: if this value is > 1, there is more variability than can be explained by the model or the error structure
    • dispersion(poisson_glm, modeltype = “poisson”)

Example code includes:

dragonflies <- readr::read_csv("./RInputFiles/data1.csv")
## Parsed with column specification:
## cols(
##   abundance = col_double(),
##   feeding_events = col_double(),
##   area = col_double(),
##   stream_flow = col_double(),
##   time = col_character(),
##   season = col_character()
## )
orchids <- readr::read_csv("./RInputFiles/lme_data.csv")
## Parsed with column specification:
## cols(
##   site = col_character(),
##   abundance = col_double(),
##   richness = col_double(),
##   humidity = col_double(),
##   tree_age = col_double()
## )
str(dragonflies)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 150 obs. of  6 variables:
##  $ abundance     : num  16 32 88 140 62 143 121 69 100 68 ...
##  $ feeding_events: num  69 153 408 691 355 678 617 334 534 362 ...
##  $ area          : num  3.67 4.57 5.1 3.19 3.83 ...
##  $ stream_flow   : num  1.288 1.279 0.596 1.5 1.165 ...
##  $ time          : chr  "day" "night" "day" "day" ...
##  $ season        : chr  "summer" "autumn" "summer" "summer" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   abundance = col_double(),
##   ..   feeding_events = col_double(),
##   ..   area = col_double(),
##   ..   stream_flow = col_double(),
##   ..   time = col_character(),
##   ..   season = col_character()
##   .. )
str(orchids)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 160 obs. of  5 variables:
##  $ site     : chr  "a" "a" "a" "a" ...
##  $ abundance: num  11 10 13 11 10 10 9 12 11 10 ...
##  $ richness : num  7 4 4 4 4 3 4 3 3 3 ...
##  $ humidity : num  59.5 70.4 73.4 53.8 66.8 57.9 78.7 81 60.1 73.3 ...
##  $ tree_age : num  14 12 9 14 9 11 13 15 7 19 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   site = col_character(),
##   ..   abundance = col_double(),
##   ..   richness = col_double(),
##   ..   humidity = col_double(),
##   ..   tree_age = col_double()
##   .. )
# Draw histogram
ggplot(dragonflies) +
    geom_histogram(aes(x = feeding_events))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Draw scatterplot
ggplot(dragonflies) +
    geom_point(aes(x = stream_flow, y = feeding_events))

# Apply a GLM
gaussian_glm <- glm(feeding_events ~ stream_flow, data = dragonflies, family = "gaussian")


# Set up a data frame for predictions
pred_df <- data.frame(stream_flow = seq(from = 1, to = 5, length = 10))

# Generate predictions
pred_df$predicted <- predict(gaussian_glm, pred_df)

# Look at the data frame
pred_df
##    stream_flow predicted
## 1         1.00    321.08
## 2         1.44    285.11
## 3         1.89    249.13
## 4         2.33    213.15
## 5         2.78    177.17
## 6         3.22    141.20
## 7         3.67    105.22
## 8         4.11     69.24
## 9         4.56     33.27
## 10        5.00     -2.71
# Add model line to plot
ggplot(dragonflies) +
    geom_point(aes(x = stream_flow, y = feeding_events)) +
    geom_line(aes(x = stream_flow, y = predicted), data = pred_df)

# Generate data frame of residuals and fitted values
diag <- data.frame(residuals = resid(gaussian_glm), fitted = fitted(gaussian_glm))

# Visualize residuals vs fitted values
ggplot(diag) +
    geom_point(aes(x = fitted, y = residuals))

# Apply Poisson GLM
poisson_glm <- glm(feeding_events ~ stream_flow, data = dragonflies, family = "poisson")

# Set up a data frame for predictions
pred_df <- data.frame(stream_flow = seq(from = 1, to = 5, length = 10))

# Generate predictions
pred_df$predicted <- predict(poisson_glm, pred_df, type = "response")

# Add line reprsenting Poisson GLM
ggplot(dragonflies) +
    geom_point(aes(x = stream_flow, y = feeding_events)) +
    geom_line(aes(x = stream_flow, y = predicted), data = pred_df)

dispersion <- function(model, modeltype = "p"){
    A <- sum(resid(model, type = "pearson") ^ 2)
    if (modeltype %in% c("poisson", "quasipoisson")) {
        B <- length(resid(model)) - length(coef(model))
    }
    if (modeltype %in% c("nb", "negativebinomial")) {
        B <- length(resid(model)) - (length(coef(model)) + 1)
    }
    DISP <- A / B
    return(DISP)
}


# Generate data frame of residuals and fitted values
diag <- data.frame(fitted=fitted(poisson_glm), residuals=resid(poisson_glm))

# Visualize residuals vs fitted values
ggplot(diag) + 
    geom_point(aes(x=fitted, y=residuals))

# Calculate the dispersion of the model
dispersion(poisson_glm, modeltype="poisson")
## [1] 97.6

Chapter 2 - Extending GLMs

Adding Factors and Interactions:

  • Can use residuals to attempt to find additional factors for modeling (address over-dispersion)
    • pr_fac(poisson_glm, dragonflies$season, xlabel = “season”, modeltype = “poisson”)
  • May want to add factors and interactions among the predictor variables
    • Multiple predictor variables (factors) may influence the response variable
    • Interaction: the effect of one predictor may depend on the level of the other predictor variable
    • poisson_glm_factor <- glm(abundance ~ stream_flow * season, data = dragonflies, family = “poisson”) # The star is an interaction effect - each alone, plus the interaction of the two
  • Can then use expand.grid to help with prediction for the interaction data
    • pred_df <- expand.grid(stream_flow = seq(from = 1, to = 5, length = 10), season = c(“summer”, “autumn”))
    • pred_df$predicted <- predict(poisson_glm_factor, pred_df, type = “response”)
    • ggplot(dragonflies) + geom_point(aes(x = stream_flow, y = abundance)) + geom_line(aes(x = stream_flow, * y = predicted, col = season), data = pred_df)
    • diag <- data.frame(residuals = resid(poisson_glm_factor), fitted = fitted(poisson_glm_factor))
    • ggplot(diag) + geom_point(aes(x = fitted, y = residuals))
    • dispersion(poisson_glm_factor, modeltype = “poisson”)

Adding an Offset to the Model:

  • There are unequal areas sampled in the data, which needs to be accounted for in the count modeling
    • May want to look at metrics on a “per square meter” or similar model
    • Can introduce issues of non-integer data, which fails the assumptions of a count model
  • Can add an offset to use for count modeling
    • dragonflies\(logarea <- log(dragonflies\)area)
    • poisson_glm_offset <- glm(abundance ~ stream_flow * season + offset(logarea), data = dragonflies, family = “poisson”)

Negative Binomial Model and Model Selection:

  • Because the Poisson model uses a single parameter for both mean and variance, overdispersion is common
  • The negative binomial GLM adds an additional term, allowing for variance different (often larger) than mean
    • Have an extra parameter, theta, which relaxes the assumptions of equality between the mean and the variance
    • This improves upon the Poisson GLM by addressing the issue of overdispersion
    • Use the same link function as Poisson GLMs
  • The MASS package has functions for running the negative binomial model
    • neg_binom_glm <- MASS::glm.nb(abundance ~ stream_flow * season + offset(logarea), data=dragonflies)
    • Use drop1 to test influence of each term
    • drop1(neg_binom_glm, test = “Chisq”)
    • neg_binom_glm_small <- glm.nb(abundance ~ stream_flow + season + offset(logarea), data=dragonflies)
    • remove the interaction term, since it has a high p-value
    • drop1(neg_binom_glm_small, test = “Chisq”)
  • Can now look at model performances and pick the best
    • dispersion(neg_binom_glm, modeltype = “nb”)
    • dispersion(neg_binom_glm_small, modeltype = “nb”)

Model Selection and Visualization:

  • Eventually need to select and defend a model even if there are some flaws or potential improvements
  • The AIC (Akaike Information Criteria) can be helpful for comparing nested models
    • AIC(neg_binom_glm, neg_binom_glm_small)
    • Lower values indicate better fit
    • A difference of three or more indicates a model is a better fit than the other
    • Generally, prefer the less complex model when the performances are similar
  • Can then generate predicted values from the selected model
    • pred_df <- expand.grid(stream_flow = seq(from = 1, to = 5, length = 10), season = c(“summer”, “autumn”), logarea = mean(dragonflies$logarea))
    • pred_df$predicted <- predict(neg_binom_glm_small, pred_df, type = “response”)
    • ggplot(dragonflies) + geom_point(aes(x = stream_flow, y = abundance)) + geom_line(aes(x = stream_flow, y = predicted, col = season), data = pred_df)
    • raw_fit <- predict(neg_binom_glm_small, pred_df, type = “link”) # fitted values on the link scale
    • raw_se <- predict(neg_binom_glm_small, pred_df, type = “link”, se.fit = TRUE)$se # SE on the link scale
    • pred_df$lower <- exp(raw_fit - 1.96 * raw_se)
    • pred_df$upper <- exp(raw_fit + 1.96 * raw_se)
    • ggplot(dragonflies) + geom_point(aes(x = stream_flow, y = abundance)) + geom_line(aes(x = stream_flow, y = predicted, col = season), data = pred_df) + geom_line(aes(x = stream_flow, y = upper, col = season), linetype = “dashed”, data = pred_df) + geom_line(aes(x = stream_flow, y = lower, col = season), linetype = “dashed”, data = pred_df)

Example code includes:

pr_fac <- function(model, plotfactor, xlabel = "", modeltype = "lm"){
    if(modeltype %in% c("linear", "lm", "poisson", "p", "quasipossion", "qp")){
        skiprows <- unique(summary(model)$na.action)
    }
    if(modeltype %in% c("negativebinomial", "nb")){
        skiprows <- unique(summary(model)[21])
    }
    if(length(skiprows) > 0) {Factor <- plotfactor[-skiprows]}
    if(length(skiprows) == 0) {Factor <- plotfactor}
    plot1 <- data.frame(PR = resid(model, type = "pearson"),Factor)
    if(is.factor(Factor) == FALSE){
        PR.plot1 <- ggplot(plot1) + 
            geom_point(aes(y = PR, x = Factor)) + 
            geom_hline(yintercept = 0, linetype = 'dashed', col = 'red')+ 
            ylab("Residuals") + xlab(xlabel) + theme_bw(18)
    }
    if(is.factor(Factor) == TRUE){
        PR.plot1 <- ggplot(plot1) +
            geom_boxplot(aes(y = PR, x = Factor)) +
            geom_hline(yintercept = 0, linetype = 'dashed', col = 'red') + 
            ylab("Residuals") + xlab(xlabel) +
            theme_bw(18)
    }
    return(PR.plot1)
}


# Compare residuals across factor levels
pr_fac(poisson_glm, dragonflies$time, xlabel = "time", modeltype = "poisson")

# Add time as a factor, including an interaction
poisson_glm_factor <- glm(feeding_events ~ stream_flow * time, data = dragonflies, family = "poisson")

# Generate predicted values of feeding_events
pred_df <- expand.grid(stream_flow = seq(from = 1, to = 5, length = 10), time = c("day", "night"))
pred_df$predicted <- predict(poisson_glm_factor, pred_df, type = "response")

# Visualize predicted values of feeding events
ggplot(dragonflies) +
    geom_point(aes(x = stream_flow, y = feeding_events)) +
    geom_line(aes(x = stream_flow, y = predicted, col = time), data = pred_df)

# Generate data frame of residuals and fitted values
diag <- data.frame(residuals=resid(poisson_glm_factor), fitted=fitted(poisson_glm_factor))

# Visualize residuals vs fitted values
ggplot(diag) +
    geom_point(aes(x=fitted, y=residuals))

# Calculate the dispersion of the model
dispersion(poisson_glm_factor, modeltype="poisson")
## [1] 86.8
# Create a column containing the natural log of area
dragonflies$logarea <- log(dragonflies$area)

# Apply Poisson GLM with interaction and offset
poisson_glm_offset <- glm(feeding_events ~ stream_flow * time + offset(logarea), 
                          data = dragonflies, family = "poisson"
                          )


# Apply Negative Binomial GLM
neg_binom_glm <- MASS::glm.nb(feeding_events ~ stream_flow*time + offset(logarea), data=dragonflies)

# Use drop1 to determine which term(s) can be dropped
drop1(neg_binom_glm, test="Chisq")
## Single term deletions
## 
## Model:
## feeding_events ~ stream_flow * time + offset(logarea)
##                  Df Deviance  AIC  LRT Pr(>Chi)
## <none>                   163 1829              
## stream_flow:time  1      164 1828 1.24     0.27
# Apply a new Negative Binomial GLM
neg_binom_glm_small <- MASS::glm.nb(feeding_events ~ stream_flow + time + offset(logarea), data=dragonflies)


# Calculate dispersion for each model
dispersion(neg_binom_glm, modeltype="nb")
## [1] 1.08
dispersion(neg_binom_glm_small, modeltype="nb")
## [1] 1.06
# Generate data frame of residuals and fitted values for neg_binom_glm
diag <- data.frame(fitted=fitted(neg_binom_glm), residuals=resid(neg_binom_glm))

# Visualize residuals vs fitted values for neg_binom_glm
ggplot(diag) + 
    geom_point(aes(x=fitted, y=residuals))

# Generate data frame of residuals and fitted values for neg_binom_glm
diag_small <- data.frame(residuals = resid(neg_binom_glm_small), fitted = fitted(neg_binom_glm_small))

# Visualize residuals vs fitted values for neg_binom_glm
ggplot(diag_small) +
    geom_point(aes(x = fitted, y = residuals))

# Compare AIC scores
AIC(neg_binom_glm, neg_binom_glm_small)
##                     df  AIC
## neg_binom_glm        5 1831
## neg_binom_glm_small  4 1830
# View the selected model
neg_binom_glm_small
## 
## Call:  MASS::glm.nb(formula = feeding_events ~ stream_flow + time + 
##     offset(logarea), data = dragonflies, init.theta = 1.837975779, 
##     link = log)
## 
## Coefficients:
## (Intercept)  stream_flow    timenight  
##       5.616       -0.871       -0.417  
## 
## Degrees of Freedom: 149 Total (i.e. Null);  147 Residual
## Null Deviance:       386 
## Residual Deviance: 163   AIC: 1830
# Create data frame
pred_df <- expand.grid(stream_flow=seq(1, 5, length=5), time=c("day", "night"), logarea=log(6))

# Generate predicted values
pred_df$predicted <- predict(neg_binom_glm_small, newdata=pred_df, type="response")

# Visualize predicted values
ggplot(dragonflies) +
    geom_point(aes(x = stream_flow, y = feeding_events)) +
    geom_line(aes(x=stream_flow, y=predicted, color=time), data=pred_df)

# Extract fitted values
raw_fit <- predict(neg_binom_glm_small, pred_df, type = "link")

# Extract standard errors
raw_se <- predict(neg_binom_glm_small, pred_df, type = "link", se = TRUE)$se

# Generate predictions of upper and lower values
pred_df$upper <- exp(raw_fit + 1.96 * raw_se)
pred_df$lower <- exp(raw_fit - 1.96 * raw_se)

# Visualize the standard errors around the predicted values
ggplot(dragonflies) +
    geom_point(aes(x = stream_flow, y = feeding_events)) +
    geom_line(aes(x = stream_flow, y = predicted, col = time), data = pred_df) +
    geom_line(aes(x = stream_flow, y = lower, col = time), linetype="dashed", data = pred_df) +
    geom_line(aes(x = stream_flow, y = upper, col = time), linetype="dashed", data = pred_df)


Chapter 3 - Mixed Effects Model I

Mixed Effects Models:

  • Mixed Effects Models are useful for modeling grouped (nested) data
  • Dataset will be about orchids - 20 different trees in 8 different sites
    • Data belonging to the same group are correlated
    • This violates assumptions about the independence of observations
  • Can ask the model to address using fixed effects
    • linear_glm <- glm(richness ~ tree_age + site, data = orchids, family = “gaussian”)
    • Treating site as a fixed effect means that we are estimating parameters for each of the eight individual sites
    • Need to have sufficient data for each group
    • Adding parameters costs us degrees of freedom
  • Can ask the model to address using mixed effects, since site is not a predictor variable, merely a source of correlation
    • Random effect: the model aims to estimate the distribution of the effect rather than estimate the effect itself as a constant
    • Concerned with the wider population rather than the individuals sampled
  • The random intercept model estimates a distribution of the random effect of site on intercept
    • nlme::lme(a ~ b | random = ~1 | site, data=)
    • random = ~1 | site specifies a random intercept model where site will act as a random effect to influence the intercept of the linear model

Model Selection and Interpretation:

  • May want to assess the relative qualities of the GLM and the random effects models
    • Linear models need to be fit with GLS (generalized least squares) to compare to mixed effects models
    • Mixed effects models need to be fit with REML (restricted maximum likelihood) to compare with linear models
      The ANOVA function can be helpful for comparing models with similar assumptions (consistencies in the underlying assumptions)
    • anova(gls_model, random_int_model)
  • Can have a problem of “testing on the boundary” - relevant only when the p-value is marginal
  • Often want to understand the variance of the random effects variables
    • The data under the “Random Effects:” area can be used - square the provided data
    • VarCorr(random_int_model) # grab data using a function instead

Visualizing a Random Intercept Model:

  • Can begin by plotting the original data with the random effects data as a color
    • ggplot(orchids) + geom_jitter(aes(x = tree_age, y = richness, col = site))
  • Can add the population effects predictions (these are based only on the fixed components; overall relationships)
    • pred_df.fixed <- data.frame(tree_age = seq(from = 5, to = 20, length = 10))
    • pred_df.fixed$predicted <- predict(random_int_model, pred_df.fixed, level = 0) # level=0 means “population level predictions”
    • ggplot(orchids) + geom_jitter(aes(x = tree_age, y = richness, col = site)) + geom_line(aes(x = tree_age, y = predicted), size = 2, data = pred_df.fixed) # no color since site is not included
  • Can add the full effects predictions (fixed and random)
    • pred_df.random <- expand.grid(tree_age = seq(from = 5, to = 20, length = 10), site = unique(orchids$site))
    • pred_df.random$random <- predict(random_int_model, pred_df.random, level = 1) # level=1 is the full prediction (random plus fixed)
    • ggplot(orchids) + geom_jitter(aes(x = tree_age, y = richness, col = site)) + geom_line(aes(x = tree_age, y = predicted), size = 2, data = pred_df.fixed) + geom_line(aes(x = tree_age, y = random, col = site), data = pred_df.random)

Example code includes:

# Create scatterplot of humidity and abundance
ggplot(orchids, aes(x=humidity, y=abundance, color=site)) +
    geom_point()

# Apply GLM
linear_glm <- glm(abundance ~ humidity + site, data = orchids, family = "gaussian")

# Look at the output to see paramters for each site
coef(linear_glm)
## (Intercept)    humidity       siteb       sitec       sited       sitee 
##      -0.702       0.177       2.940      -6.582      -6.492      -1.579 
##       sitef       siteg       siteh 
##      -5.607      -6.000      -5.336
# Apply random intercept model
random_int_model <- nlme::lme(abundance ~ humidity, random = ~1|site, data=orchids)

# Look at model output
random_int_model
## Linear mixed-effects model fit by REML
##   Data: orchids 
##   Log-restricted-likelihood: -420
##   Fixed: abundance ~ humidity 
## (Intercept)    humidity 
##      -4.375       0.179 
## 
## Random effects:
##  Formula: ~1 | site
##         (Intercept) Residual
## StdDev:        3.52     3.07
## 
## Number of Observations: 160
## Number of Groups: 8
# Fit linear model using Generalized Least Squares
gls_model <- nlme::gls(abundance ~ humidity, data = orchids)

# Apply a random intercept model
random_int_model <- nlme::lme(abundance ~ humidity, random = ~1 | site, data = orchids, method = "REML")

# Apply likelihood ratio test to compare models
anova(gls_model, random_int_model)
##                  Model df AIC BIC logLik   Test L.Ratio p-value
## gls_model            1  3 944 953   -469                       
## random_int_model     2  4 849 861   -420 1 vs 2    97.5  <.0001
# Print the model that fits better
random_int_model
## Linear mixed-effects model fit by REML
##   Data: orchids 
##   Log-restricted-likelihood: -420
##   Fixed: abundance ~ humidity 
## (Intercept)    humidity 
##      -4.375       0.179 
## 
## Random effects:
##  Formula: ~1 | site
##         (Intercept) Residual
## StdDev:        3.52     3.07
## 
## Number of Observations: 160
## Number of Groups: 8
# Calculate estimate of variance for the random intercept
calculated_value <- 3.515514**2
calculated_value
## [1] 12.4
# Extract estimate of variance for the random intercept
extracted_value <- nlme::VarCorr(random_int_model)[1, 1]
extracted_value
## [1] "12.4"
# Create data frame for fixed component
pred_df.fixed <- data.frame(humidity = seq(from = 40, to = 75, length = 10))

# Generate population level predictions
pred_df.fixed$predicted <- predict(random_int_model, pred_df.fixed, level = 0)

# Visualize predicted values
ggplot(orchids) +
    geom_point(aes(x = humidity, y = abundance, col = site)) +
    geom_line(aes(x=humidity, y=predicted), data=pred_df.fixed, size=2)

# Create data frame for random component
pred_df.random <- expand.grid(humidity = seq(from = 40, to = 75, length = 10), site = unique(orchids$site))

# Generate within-site predictions
pred_df.random$random <- predict(random_int_model, newdata=pred_df.random, level=1)

# Visualize predicted values
ggplot(orchids) +
    geom_point(aes(x = humidity, y = abundance, col = site)) +
    geom_line(aes(x = humidity, y = predicted), size = 2, data = pred_df.fixed) +
    geom_line(aes(x=humidity, y=random, col = site), data = pred_df.random)


Chapter 4 - Mixed Effects Models II

Random Intercept and Slope Models:

  • May want to have the random effects drive both intercept and slope - no longer parallel lines, but instead different slopes AND intercepts
  • Can add interaction term in GLM, but may require too many parameters and too great a requirement for DF
    • linear_glm <- glm(richness ~ tree_age * site, data = orchids, family = “gaussian”)
  • Can instead consider site as part of a mixed effects model
    • random_int_slope_model <- lme(richness ~ tree_age, random = ~1 + tree_age | site, data = orchids)
    • Random intercept model: ~1 | randomEffect
    • Random intercept and slope model: ~1 + predictor | randomEffect
  • Can then generate population level predictions
    • pred_df.fixed <- data.frame(tree_age = seq(from = 5, to = 20, length = 10))
    • pred_df.fixed$predicted <- predict(random_int_slope_model, pred_df.fixed, level = 0) # level=0 is for population level
  • Can also generate individual predictions
    • pred_df.random <- expand.grid(tree_age = seq(from = 5, to = 20, length = 10), site = unique(orchids$site))
    • pred_df.random$random <- predict(random_int_slope_model, pred_df.random, level = 1) # level=1 is the full prediction
    • ggplot(orchids) + geom_jitter(aes(x = tree_age, y = richness, col = site)) + geom_line(aes(x = tree_age, y = predicted), size = 2, data = pred_df.fixed) + geom_line(aes(x = tree_age, y = random, col = site), data = pred_df.random)

Model Selection and Interpretation:

  • May want to assess relative performance of including random effects slope term - need to fit with REML
    • random_int_model <- lme(richness ~ tree_age, random = ~1 | site, data = orchids, method = “REML”)
    • random_int_slope_model <- lme(richness ~ tree_age, random = ~1 + tree_age| site, data = orchids, method = “REML”)
    • anova(random_int_model, random_int_slope_model)
  • May want to correct the p-value for how ANOVA works with mixed-effects models
    • LR <- ((-284.8980) - (-284.6478)) * -2 # -284.898 and -284.6748 are the logLik values from the regression
    • ((1 - pchisq(LR, 1)) + (1 - pchisq(LR, 2))) * 0.5 # corrected p-values
  • May want to calculate the variances - can either calculate directly by squaring from the output
    • There is also now a correlation variable shown as Corr which shows whether slopes are linked to intercepts

Using Modeling as a Tool:

  • Modeling should be used as a tool for answering key research questions
    • What are the data?
    • What is the research question?
    • What are the variables of interest?
    • What do the raw data look like?
  • May want to use GLM in cases where the errors are not normally distributed
    • Flexible generalization of an ordinary linear model
    • Common approach for modeling count data, but broadly applicable
  • Need to clearly define the model formula and links for reporting
    • What is my model formula?
    • What link function is used?
    • How do I know this model is appropriate?
    • How did I choose this model over other models?
    • What does this model tell me about my research question?
  • Mixed effects models aim to estimate a random categorical effect
    • Aim to estimate the distribution of a random categorical effect
    • Attempts to answer a question about the wider population rather than compare specific groups
  • Should answer key questions about the random effects modeling
    • What is my model formula?
    • How did I choose this model over other models?
    • Why did I choose to use this model type?
    • What does this model tell me about my research question?
  • General practices for reporting outcomes
    • No firm rules about what to include when reporting the results of your model
    • Do some research!

Wrap Up:

  • GLM and Mixed Effects Models
  • Evaluate and defend models
  • Visualize predictions
  • Extract pertinent information

Example code includes:

# Apply random intercept and slope model
random_int_slope_model <- nlme::lme(abundance ~ humidity, random = ~1 + humidity | site, data=orchids)

# Look at model output
random_int_slope_model
## Linear mixed-effects model fit by REML
##   Data: orchids 
##   Log-restricted-likelihood: -409
##   Fixed: abundance ~ humidity 
## (Intercept)    humidity 
##      -4.141       0.176 
## 
## Random effects:
##  Formula: ~1 + humidity | site
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev Corr  
## (Intercept) 7.328  (Intr)
## humidity    0.139  -0.885
## Residual    2.721        
## 
## Number of Observations: 160
## Number of Groups: 8
# Create data frame for fixed component
pred_df.fixed <- data.frame(humidity = seq(from = 40, to = 75, length = 10))

# Generate population level predictions
pred_df.fixed$predicted <- predict(random_int_slope_model, pred_df.fixed, level = 0)

# Create data frame for random component
pred_df.random <- expand.grid(humidity = seq(from = 40, to = 75, length = 10), site = unique(orchids$site))

# Generate within-site predictions
pred_df.random$random <- predict(random_int_slope_model, pred_df.random, level = 1)


# Visualize population level predictions and within-site predictions of abundance
ggplot(orchids) +
    geom_point(aes(x = humidity, y = abundance, col = site)) +
    geom_line(aes(x=humidity, y=predicted), size=2, data=pred_df.fixed)

# Visualize population level predictions and within-site predictions of abundance
ggplot(orchids) +
    geom_point(aes(x = humidity, y = abundance, col = site)) +
    geom_line(aes(x = humidity, y = predicted), size = 2, data = pred_df.fixed) +
    geom_line(aes(x=humidity, y=random, col = site), data = pred_df.random)

# Apply a maximum likelihood ratio test
anova(random_int_model, random_int_slope_model)
##                        Model df AIC BIC logLik   Test L.Ratio p-value
## random_int_model           1  4 849 861   -420                       
## random_int_slope_model     2  6 829 848   -409 1 vs 2    23.1  <.0001
# Calculate the corrected p-value
LR <- ((-420.2667) - (-408.7254)) * -2
((1 - pchisq(LR, 1)) + (1 - pchisq(LR, 2))) * 0.5
## [1] 5.64e-06
# Print the model that has more parameters
fewer_parameters <- random_int_model
fewer_parameters
## Linear mixed-effects model fit by REML
##   Data: orchids 
##   Log-restricted-likelihood: -420
##   Fixed: abundance ~ humidity 
## (Intercept)    humidity 
##      -4.375       0.179 
## 
## Random effects:
##  Formula: ~1 | site
##         (Intercept) Residual
## StdDev:        3.52     3.07
## 
## Number of Observations: 160
## Number of Groups: 8
# Print the model that has the better AIC value
better_aic_value <- random_int_slope_model
better_aic_value
## Linear mixed-effects model fit by REML
##   Data: orchids 
##   Log-restricted-likelihood: -409
##   Fixed: abundance ~ humidity 
## (Intercept)    humidity 
##      -4.141       0.176 
## 
## Random effects:
##  Formula: ~1 + humidity | site
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev Corr  
## (Intercept) 7.328  (Intr)
## humidity    0.139  -0.885
## Residual    2.721        
## 
## Number of Observations: 160
## Number of Groups: 8
# View the model output
random_int_slope_model
## Linear mixed-effects model fit by REML
##   Data: orchids 
##   Log-restricted-likelihood: -409
##   Fixed: abundance ~ humidity 
## (Intercept)    humidity 
##      -4.141       0.176 
## 
## Random effects:
##  Formula: ~1 + humidity | site
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev Corr  
## (Intercept) 7.328  (Intr)
## humidity    0.139  -0.885
## Residual    2.721        
## 
## Number of Observations: 160
## Number of Groups: 8
# Calculate the estimated variance of random intercept
variance_int <- 7.3277203**2

# Calculate the estimated variance of random slope
variance_slope <- 0.1387053**2

# Print the higher estimate
variance_int
## [1] 53.7
myData <- data.frame(X=1:160, 
                     y=c(7, 4, 4, 4, 4, 3, 4, 3, 3, 3, 7, 3, 3, 3, 3, 3, 4, 4, 6, 3, 2, 3, 3, 7, 8, 2, 3, 4, 9, 4, 3, 3, 4, 6, 4, 3, 3, 4, 7, 5, 2, 1, 2, 2, 2, 3, 0, 3, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 1, 2, 3, 1, 3, 1, 2, 2, 0, 2, 0, 2, 3, 1, 2, 1, 2, 1, 1, 2, 2, 10, 3, 0, 2, 2, 7, 4, 0, 2, 2, 5, 2, 1, 2, 2, 5, 4, 1, 3, 2, 2, 1, 4, 2, 2, 1, 2, 2, 2, 3, 3, 2, 3, 2, 4, 2, 2, 4, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 0, 2, 2, 2, 1, 0, 1, 3, 2, 2, 2, 3, 2, 0, 1, 3, 3, 2, 1, 2, 3, 3, 3, 0, 2, 2, 2, 2, 1), 
                     x=c(59.5, 70.4, 73.4, 53.8, 66.8, 57.9, 78.7, 81, 60.1, 73.3, 69.3, 60.1, 63.4, 48.1, 65.4, 58.2, 68.9, 75.1, 58.3, 69.4, 48.1, 51.8, 53.6, 59.4, 73.3, 43.4, 45.2, 58.5, 52.3, 81.5, 39.9, 48.1, 50.8, 49.8, 82.5, 44.3, 56.1, 57.6, 62.9, 78, 69.8, 45.1, 62, 64.8, 58.3, 68.6, 38.6, 64.5, 57, 65.9, 71.3, 33.8, 72.7, 71.1, 53.7, 73.4, 46.2, 63.4, 62.9, 53, 42.3, 60.3, 55.4, 46.3, 68.1, 45.5, 53.7, 47.1, 43, 58.7, 48, 68.9, 54.7, 58.2, 66.4, 42.5, 59.1, 50.6, 50.2, 70.8, 48.8, 65, 52.7, 43.7, 59.5, 42.1, 74.7, 61.6, 45.6, 61.2, 59.3, 72.5, 42.3, 50.1, 68.9, 44.1, 68.6, 49.6, 44.4, 61.9, 65.8, 63.5, 37.8, 68.9, 52.3, 74.1, 66, 38, 68.2, 55.7, 62.1, 57.8, 29.5, 76.6, 51.3, 70.3, 59.6, 37, 65.2, 54, 51.2, 42.1, 46.2, 60.6, 56.3, 51.1, 42.5, 38.5, 51.6, 50.4, 59.1, 33.4, 49.2, 65.7, 64.7, 57, 38.7, 49.7, 60.2, 54.7, 43.3, 63.7, 70, 58.3, 39.5, 48.5, 68.6, 68.5, 57.6, 47, 40.3, 70.8, 77.4, 59.9, 48.8, 43.5, 63, 65.1, 54.9, 40), 
                     group=as.factor(c('a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h'))
                     )

model1 <- nlme::gls(y ~ x, data = myData)
model2 <- glm(y ~ x, data = myData)
model3 <- glm(y ~ x * group, data = myData, family = "gaussian")
model4 <- MASS::glm.nb(y ~ x, data = myData)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
model5 <- nlme::lme(y ~ x, random = ~1|group, data = myData)
model6 <- nlme::lme(y ~ x, random = ~1 + x|group, data = myData)


# Print the coefficients of the model
coef(model3)
## (Intercept)           x      groupb      groupc      groupd      groupe 
##      4.0308     -0.0020     -2.9804     -4.5017     -5.7360     -9.2038 
##      groupf      groupg      grouph    x:groupb    x:groupc    x:groupd 
##     -3.3564     -5.8536     -5.4391      0.0600      0.0408      0.0627 
##    x:groupe    x:groupf    x:groupg    x:grouph 
##      0.1475      0.0310      0.0679      0.0615
# Print the model output
model4
## 
## Call:  MASS::glm.nb(formula = y ~ x, data = myData, init.theta = 31010.139, 
##     link = log)
## 
## Coefficients:
## (Intercept)            x  
##     -0.6224       0.0265  
## 
## Degrees of Freedom: 159 Total (i.e. Null);  158 Residual
## Null Deviance:       168 
## Residual Deviance: 131   AIC: 555
# Print the model formula
model5$call
## lme.formula(fixed = y ~ x, data = myData, random = ~1 | group)
model1 <- nlme::gls(y ~ x, data = myData)
model2 <- glm(y ~ x, data = myData)
model3 <- glm(y ~ x * group, data = myData, family = "gaussian")
model4 <- nlme::lme(y ~ x, random = ~1|group, data = myData, method = "REML")
model5 <- nlme::lme(y ~ x, random = ~1|group, data = myData, method = "ML")


# Apply a likelihood ratio test
anova(model4, model1)
##        Model df AIC BIC logLik   Test L.Ratio p-value
## model4     1  4 552 564   -272                       
## model1     2  3 596 605   -295 1 vs 2      46  <.0001
# Correct the p-value
LR <- ((-295.0109) - (-272.0328)) * -2
(1 - pchisq(LR, 1)) * 0.5
## [1] 6.05e-12
# Generate data frame of residuals and fitted values
diag <- data.frame(residuals=model2$residuals, fitted=model2$fitted)

# Visualize residuals vs fitted values
ggplot(diag, aes(x=fitted, y=residuals)) +
    geom_point()

model1 <- glm(y ~ x * group, data = myData, family = "gaussian")
model2 <- MASS::glm.nb(y ~ x * group, data = myData)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
model3 <- nlme::lme(y ~ x, random = ~1 + x|group, data = myData)


# Create a data frame where x is 35 and 40 at each group
pred_df <- expand.grid(x = c(35, 40), group = unique(myData$group))

# Generate predictions using model1
predict(model1, pred_df)
##       1       2       3       4       5       6       7       8       9 
##  3.9610  3.9510  3.0816  3.3718  0.8874  1.0814  0.4178  0.7211 -0.0807 
##      10      11      12      13      14      15      16 
##  0.6468  1.6911  1.8363  0.4853  0.8150  0.6745  0.9720
# Generate predictions using model2
predict(model2, pred_df, type = "response")
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 3.961 3.951 3.266 3.477 0.997 1.122 0.741 0.891 0.922 1.180 1.720 1.837 
##    13    14    15    16 
## 0.692 0.868 0.933 1.091
# Generate predictions using model3
predict(model3, pred_df)
##     a     a     b     b     c     c     d     d     e     e     f     f 
## 2.746 2.922 2.920 3.218 0.803 1.027 0.497 0.796 0.618 1.172 1.505 1.694 
##     g     g     h     h 
## 0.574 0.892 0.733 1.027 
## attr(,"label")
## [1] "Predicted values"

Topic Modeling in R

Chapter 1 - Introduction to the Workflow

Background:

  • Topics give a quick description of what an article is about
    • A topic is a label for a collection of words that often occur together. E.g., weather includes words: rain, storm, snow, winds, ice
  • Topic modeling is the process of finding a collection of topics fitted to a set of documents
    • Increasingly popular, helps the audience know if the article is likely of interest for them
  • Course will focus on one specific implementation of topic modeling algorithms, called Latent Dirichlet Allocation (LDA)
    • LDA takes a document-term matrix as its input - frequencies of words, but NOT word orders
    • A collection of documents is referred to as a corpus
  • Can use LDA as a supervised clustering algorithm
    • lda_mod <- LDA(x=d, k=2, method=“Gibbs”, control=list(alpha=1, delta=0.1, seed=10005, keep=1))
    • And the result is two tables, for terms and topics.
  • Some general best practices apply to topic modeling
    • Matrices are not a good way to present the results. We need to use charts
    • There are choices which words to keep and which ones to exclude from a document-term matrix
    • Documents can be constructed in multiple way: they can be based on chapters in a novel, on paragraphs, or even on a sequence of several words
    • The LDA algorithm relies on control parameters which can impact the output

Counting Words:

  • The task of splitting text into words is also called ‘tokenization’ - sequence of characters or sequence of words
  • Package tidytext has function unnest_tokens() that does the splitting
    • unnest_tokens(data, input=text, output=word, format=“text”, tokens=“word”, drop=TRUE, to_lower=TRUE)
    • It returns a tidy table, with one word per row
  • Can use the count() function from dplyr to count the words from unnest_tokens()
    • book %>% unnest_tokens(input=text, output=word) %>% count(chapter, word)
    • To get the top-n words: group words by chapter, sort/arrange by count in descending order, keep rows whose number is less than n
    • book %>% unnest_tokens(input=text, output=word) %>% count(chapter, word) %>% group_by(chapter) %>% arrange(desc(n)) %>% filter(row_number() < 3) %>% ungroup()
  • Can cast the word counts in to a document-term matrix
    • Casting a table means transforming it into a different format
    • A document-term matrix (dtm) contains counts of words
    • Each row corresponds to a document, each column - to a word
    • In our case, each chapter is its own document
    • Package tidytext has function cast_dtm to do this transformation
    • Just add cast_dtm after count
    • cast_dtm(data, document=chapter, term=word, value=n)
    • dtm <- book %>%
    • unnest_tokens(input=text, output=word) %>%
    • count(chapter, word) %>%
    • cast_dtm(document=chapter, term=word, value=n) # created as a sparse matrix
    • as.matrix(dtm) # converts to a regular matrix

Displaying Frequencies with ggplot:

  • Can plot either word counts or probabilities of a document belonging to a topic
  • When we fit a topic model, the result is an LDA model object - It contains two matrices: beta and gamma
    • beta contains probabilities of words in topics
    • gamma contains probabilities of topics in documents
    • lda_mod <- LDA(x=d2, k=2, method=“Gibbs”, control=list(alpha=1, delta=0.1, seed=10005))
    • str(lda_mod) # will have @beta and @gamma slots
  • Can create stacked column charts for topic membership using gamma
    • tidy(lda_mod, matrix=“gamma”) %>%
    • ggplot(aes(x=document, y=gamma)) +
    • geom_col(aes(fill=as.factor(topic)))
  • Can create dodge column charts for word frequencies using beta
    • tidy(lda_mod, matrix=“beta”) %>%
    • ggplot(aes(x=term, y=beta)) +
    • geom_col(aes(fill=as.factor(topic)), position=position_dodge())
    • tidy(lda_mod, matrix=“beta”) %>%
    • mutate(topic = as.factor(topic)) %>%
    • ggplot(aes(x=term, y=beta)) +
    • geom_col(aes(fill=topic), position=position_dodge()) +
    • theme(axis.text.x = element_text(angle=90)) # labels rotated 90 degrees

Example code includes:

word_topics <- matrix(data=c(0.0034, 0.0279, 0.0374, 0.0025, 0.0034, 0.0787, 0.0034, 0.0279, 0.0034, 0.0533, 0.1054, 0.0025, 0.0034, 0.0787, 0.0034, 0.1294, 0.0034, 0.1041, 0.0034, 0.0279, 0.0034, 0.0279, 0.0034, 0.0279, 0.0374, 0.0025, 0.0034, 0.0533, 0.0034, 0.0787, 0.0034, 0.1294, 0.0034, 0.0279, 0.0714, 0.0025, 0.0374, 0.0025, 0.0374, 0.0025, 0.0714, 0.0025, 0.1054, 0.0025, 0.1395, 0.0025, 0.0374, 0.0025, 0.0374, 0.0025, 0.0374, 0.0025, 0.0374, 0.0025, 0.0374, 0.0025, 0.0374, 0.0025, 0.0034, 0.0279, 0.0374, 0.0025, 0.0034, 0.0279, 0.0034, 0.0279, 0.0374, 0.0025), 
                      nrow=2, ncol=34, byrow=FALSE, 
                      dimnames=list(c(1, 2), 
                                    c('agreed', 'bad', 'bank', 'due', 'fines', 'loans', 'pay', 'the', 'to', 'are', 'face', 'if', 'late', 'off', 'will', 'you', 'your', 'a', 'downtown', 'in', 'new', 'opened', 'restaurant', 'is', 'just', 'on', 'street', 'that', 'there', 'warwick', 'for', 'how', 'need', 'want'
                                      )
                                    )
                      )


# Display the column names
colnames(word_topics)
##  [1] "agreed"     "bad"        "bank"       "due"        "fines"     
##  [6] "loans"      "pay"        "the"        "to"         "are"       
## [11] "face"       "if"         "late"       "off"        "will"      
## [16] "you"        "your"       "a"          "downtown"   "in"        
## [21] "new"        "opened"     "restaurant" "is"         "just"      
## [26] "on"         "street"     "that"       "there"      "warwick"   
## [31] "for"        "how"        "need"       "want"
# Display the probability
word_topics[1, "street"]
## [1] 0.0374
ch1 <- paste0('Two thousand five hundred and fifty-eight years ago a little fleet of galleys toiled painfully against the current up the long strait of the Hellespont, rowed across the broad Propontis, and came to anchor in the smooth waters of the first inlet which cuts into the European shore of the Bosphorus. There a long crescent-shaped creek, which after-ages were to know as the Golden Horn, strikes inland for seven miles, forming a quiet backwater from the rapid stream which runs outside. On the headland, enclosed between this inlet and the open sea, a few hundred colonists disembarked, and hastily secured themselves from the wild tribes of the inland, by running some rough sort of a stockade across the ground from beach to beach. Thus was founded the city of Byzantium. The settlers were Greeks of the Dorian race, natives of the thriving seaport-state of Megara, one of the most enterprising of all the cities of Hellas in the time of colonial and commercial expansion which was then at its height. Wherever a Greek prow had cut its way into unknown waters, there Megarian seamen were soon found following in its wake. ', 
              'One band of these venturesome traders pushed far to the West to plant colonies in Sicily, but the larger share of the attention of Megara was turned towards the sunrising, towards the mist-enshrouded entrance of the Black Sea and the fabulous lands that lay beyond. There, as legends told, was to be found the realm of the Golden Fleece, the Eldorado of the ancient world, where kings of untold wealth reigned over the tribes of Colchis: there dwelt, by the banks of the river Thermodon, the Amazons, the warlike women who had once vexed far-off Greece by their inroads: there, too, was to be found, if one could but struggle far enough up its northern shore, the land of the Hyperboreans, the blessed folk who dwell behind the North Wind and know nothing of storm and winter. To seek these fabled wonders the Greeks sailed ever North and East till they had come to the extreme limits of the sea. The riches of the Golden Fleece they did not find, nor the country of the Hyperboreans, nor the tribes of the Amazons; but they did discover many lands well worth the knowing, and grew rich on the profits which they drew from the metals of Colchis and the forests of Paphlagonia, from the rich corn lands by the banks of the Dnieper and Bug, and the fisheries of the Bosphorus and the Maeotic Lake. Presently the whole coastland of the sea, which the Greeks, on their first coming, called Axeinos--\"the Inhospitable\"--became fringed with trading settlements, and its name was changed to Euxeinos--\"the Hospitable\"--in recognition of its friendly ports. It was in a similar spirit that, two thousand years later, the seamen who led the next great impulse of exploration that rose in Europe, turned the name of the \"Cape of Storms\" into that of the \"Cape of Good Hope.\" The Megarians, almost more than any other Greeks, devoted their attention to the Euxine, and the foundation of Byzantium was but one of their many achievements. ', 
              'Already, seventeen years before Byzantium came into being, another band of Megarian colonists had established themselves at Chalcedon, on the opposite Asiatic shore of the Bosphorus. The settlers who were destined to found the greater city applied to the oracle of Delphi to give them advice as to the site of their new home, and Apollo, we are told, bade them \"build their town over against the city of the blind.\" They therefore pitched upon the headland by the Golden Horn, reasoning that the Chalcedonians were truly blind to have neglected the more eligible site on the Thracian shore, in order to found a colony on the far less inviting Bithynian side of the strait. Early Coin Of Byzantium. Late Coin Of Byzantium Showing Crescent And Star. From the first its situation marked out Byzantium as destined for a great future. Alike from the military and from the commercial point of view no city could have been better placed. Looking out from the easternmost headland of Thrace, with all Europe behind it and all Asia before, it was equally well suited to be the frontier fortress to defend the border of the one, or the basis of operations for an invasion from the other. ', 
              'As fortresses went in those early days it was almost impregnable--two sides protected by the water, the third by a strong wall not commanded by any neighbouring heights. In all its early history Byzantium never fell by storm: famine or treachery accounted for the few occasions on which it fell into the hands of an enemy. In its commercial aspect the place was even more favourably situated. It completely commanded the whole Black Sea trade: every vessel that went forth from Greece or Ionia to traffic with Scythia or Colchis, the lands by the Danube mouth or the shores of the Maeotic Lake, had to pass close under its walls, so that the prosperity of a hundred Hellenic towns on the Euxine was always at the mercy of the masters of Byzantium. The Greek loved short stages and frequent stoppages, and as a half-way house alone Byzantium would have been prosperous: but it had also a flourishing local trade of its own with the tribes of the neighbouring Thracian inland, and drew much profit from its fisheries: so much so that the city badge--its coat of arms as we should call it--comprised a tunny-fish as well as the famous ox whose form alluded to the legend of the naming of the Bosphorus. As an independent state Byzantium had a long and eventful history. ', 
              'For thirty years it was in the hands of the kings of Persia, but with that short exception it maintained its freedom during the first three hundred years that followed its foundation. Many stirring scenes took place beneath its walls: it was close to them that the great Darius threw across the Bosphorus his bridge of boats, which served as a model for the more famous structure on which his son Xerxes crossed the Hellespont. Fifteen years later, when Byzantium in common with all its neighbours made an ineffectual attempt to throw off the Persian yoke, in the rising called the \"Ionic Revolt,\" it was held for a time by the arch-rebel Histiaeus, who--as much to enrich himself as to pay his seamen--invented strait dues. He forced every ship passing up or down the Bosphorus to pay a heavy toll, and won no small unpopularity thereby for the cause of freedom which he professed to champion. Ere long Byzantium fell back again into the hands of Persia, but she was finally freed from the Oriental yoke seventeen years later, when the victorious Greeks, fresh from the triumph of Salamis and Mycale, sailed up to her walls and after a long leaguer starved out the obstinate garrison [B.C. 479]. ', 
              'The fleet wintered there, and it was at Byzantium that the first foundations of the naval empire of Athens were laid, when all the Greek states of Asia placed their ships at the disposal of the Athenian admirals Cimon and Aristeides. During the fifth century Byzantium twice declared war on Athens, now the mistress of the seas, and on each occasion fell into the hands of the enemy--once by voluntary surrender in 439 B.C., once by treachery from within, in 408 B.C. But the Athenians, except in one or two disgraceful cases, did not deal hardly with their conquered enemies, and the Byzantines escaped anything harder than the payment of a heavy war indemnity. In a few years their commercial gains repaired all the losses of war, and the state was itself again. We know comparatively little about the internal history of these early centuries of the life of Byzantium. Some odd fragments of information survive here and there: we know, for example, that they used iron instead of copper for small money, a peculiarity shared by no other ancient state save Sparta. Their alphabet rejoiced in an abnormally shaped {~GREEK CAPITAL LETTER BETA~}, which puzzled all other Greeks, for it resembled a {~GREEK CAPITAL LETTER PI~} with an extra limb. The chief gods of the city were those that we might have expected--Poseidon the ruler of the sea, whose blessing gave Byzantium its chief wealth; and Demeter, the goddess who presided over the Thracian and Scythian corn lands which formed its second source of prosperity. ', 
              'The Byzantines were, if ancient chroniclers tell us the truth, a luxurious as well as a busy race: they spent too much time in their numerous inns, where the excellent wines of Maronea and other neighbouring places offered great temptations. They were gluttons too as well as tipplers: on one occasion, we are assured, the whole civic militia struck work in the height of a siege, till their commander consented to allow restaurants to be erected at convenient distances round the ramparts. One comic writer informs us that the Byzantines were eating young tunny-fish--their favourite dish--so constantly, that their whole bodies had become well-nigh gelatinous, and it was thought they might melt if exposed to too great heat! Probably these tales are the scandals of neighbours who envied Byzantine prosperity, for it is at any rate certain that the city showed all through its history great energy and love of independence, and never shrank from war as we should have expected a nation of epicures to do. It was not till the rise of Philip of Macedon and his greater son Alexander that Byzantium fell for the fifth time into the hands of an enemy. The elder king was repulsed from the citys walls after a long siege, culminating in an attempt at an escalade by night, which was frustrated owing to the sudden appearance of a light in heaven, which revealed the advancing enemy and was taken by the Byzantines as a token of special divine aid [B.C. 339]. In commemoration of it they assumed as one of their civic badges the blazing crescent and star, which has descended to our own days and is still used as an emblem by the present owners of the city--the Ottoman Sultans. ', 
              'But after repulsing Philip the Byzantines had to submit some years later to Alexander. They formed under him part of the enormous Macedonian empire, and passed on his decease through the hands of his successors--Demetrius Poliorcetes, and Lysimachus. After the death of the latter in battle, however, they recovered a precarious freedom, and were again an independent community for a hundred years, till the power of Rome invaded the regions of Thrace and the Hellespont. Byzantium was one of the cities which took the wise course of making an early alliance with the Romans, and obtained good and easy terms in consequence. During the wars of Rome with Macedon and Antiochus the Great it proved such a faithful assistant that the Senate gave it the status of a _civitas libera et foederata_, \"a free and confederate city,\" and it was not taken under direct Roman government, but allowed complete liberty in everything save the control of its foreign relations and the payment of a tribute to Rome. It was not till the Roman Republic had long passed away, that the Emperor Vespasian stripped it of these privileges, and threw it into the province of Thrace, to exist for the future as an ordinary provincial town [A.D. 73]. ', 
              'Though deprived of a liberty which had for long years been almost nominal, Byzantium could not be deprived of its unrivalled position for commerce. It continued to flourish under the _Pax Romana_, the long-continued peace which all the inner countries of the empire enjoyed during the first two centuries of the imperial _regime_, and is mentioned again and again as one of the most important cities of the middle regions of the Roman world. But an evil time for Byzantium, as for all the other parts of the civilized world, began when the golden age of the Antonines ceased, and the epoch of the military emperors followed. In 192 A.D., Commodus, the unworthy son of the great and good Marcus Aurelius, was murdered, and ere long three military usurpers were wrangling for his blood-stained diadem. Most unhappily for itself Byzantium lay on the line of division between the eastern provinces, where Pescennius Niger had been proclaimed, and the Illyrian provinces, where Severus had assumed the imperial style. The city was seized by the army of Syria, and strengthened in haste. Presently Severus appeared from the west, after he had made himself master of Rome and Italy, and fell upon the forces of his rival Pescennius. Victory followed the arms of the Illyrian legions, the east was subdued, and the Syrian emperor put to death. But when all his other adherents had yielded, the garrison of Byzantium refused to submit. ', 
              'For more than two years they maintained the impregnable city against the lieutenants of Severus, and it was not till A.D. 196 that they were forced to yield. The emperor appeared in person to punish the long-protracted resistance of the town; not only the garrison, but the civil magistrates of Byzantium were slain before his eyes. The massive walls \"so firmly built with great square stones clamped together with bolts of iron, that the whole seemed but one block,\" were laboriously cast down. The property of the citizens was confiscated, and the town itself deprived of all municipal privileges and handed over to be governed like a dependent village by its neighbours of Perinthus. Caracalla, the son of Severus, gave back to the Byzantines the right to govern themselves, but the town had received a hard blow, and would have required a long spell of peace to recover its prosperity. Peace however it was not destined to see. All through the middle years of the third century it was vexed by the incursions of the Goths, who harried mercilessly the countries on the Black Sea whose commerce sustained its trade. Under Gallienus in A.D. 263 it was again seized by an usurping emperor, and shared the fate of his adherents. The soldiers of Gallienus sacked Byzantium from cellar to garret, and made such a slaughter of its inhabitants that it is said that the old Megarian race who had so long possessed it were absolutely exterminated. ', 
              'But the irresistible attraction of the site was too great to allow its ruins to remain desolate. Within ten years after its sack by the army of Gallienus, we find Byzantium again a populous town, and its inhabitants are specially praised by the historian Trebellius Pollio for the courage with which they repelled a Gothic raid in the reign of Claudius II. The strong Illyrian emperors, who staved off from the Roman Empire the ruin which appeared about to overwhelm it in the third quarter of the third century, gave Byzantium time and peace to recover its ancient prosperity. It profited especially from the constant neighbourhood of the imperial court, after Diocletian fixed his residence at Nicomedia, only sixty miles away, on the Bithynian side of the Propontis. But the military importance of Byzantium was always interfering with its commercial greatness. After the abdication of Diocletian the empire was for twenty years vexed by constant partitions of territory between the colleagues whom he left behind him. Byzantium after a while found itself the border fortress of Licinius, the emperor who ruled in the Balkan Peninsula, while Maximinus Daza was governing the Asiatic provinces. ', 
              'While Licinius was absent in Italy, Maximinus treacherously attacked his rivals dominions without declaration of war, and took Byzantium by surprise. But the Illyrian emperor returned in haste, defeated his grasping neighbour not far from the walls of the city, and recovered his great frontier fortress after it had been only a few months out of his hands [A.D. 314]. The town must have suffered severely by changing masters twice in the same year; it does not, however, seem to have been sacked or burnt, as was so often the case with a captured city in those dismal days. But Licinius when he had recovered the place set to work to render it impregnable. Though it was not his capital he made it the chief fortress of his realm, which, since the defeat of Maximinus, embraced the whole eastern half of the Roman world. It was accordingly at Byzantium that Licinius made his last desperate stand, when in A.D. 323 he found himself engaged in an unsuccessful war with his brother-in-law Constantine, the Emperor of the West. For many months the war stood still beneath the walls of the city; but Constantine persevered in the siege, raising great mounds which overlooked the walls, and sweeping away the defenders by a constant stream of missiles, launched from dozens of military engines which he had erected on these artificial heights. At last the city surrendered, and the cause of Licinius was lost. Constantine, the last of his rivals subdued, became the sole emperor of the Roman world, and stood a victor on the ramparts which were ever afterwards to bear his name.')

ch2 <- paste0('When the fall of Byzantium had wrecked the fortunes of Licinius, the Roman world was again united beneath the sceptre of a single master. For thirty-seven years, ever since Diocletian parcelled out the provinces with his colleagues, unity had been unknown, and emperors, whose number had sometimes risen to six and sometimes sunk to two, had administered their realms on different principles and with varying success. Constantine, whose victory over his rivals had been secured by his talents as an administrator and a diplomatist no less than by his military skill, was one of those men whose hard practical ability has stamped upon the history of the world a much deeper impress than has been left by many conquerors and legislators of infinitely greater genius. He was a man of that self-contained, self-reliant, unsympathetic type of mind which we recognize in his great predecessor Augustus, or in Frederic the Great of Prussia. Constantine the Great Though the strain of old Roman blood in his veins must have been but small, Constantine was in many ways a typical Roman; the hard, cold, steady, unwearying energy, which in earlier centuries had won the empire of the world, was once more incarnate in him. ', 
              'But if Roman in character, he was anything but Roman in his sympathies. Born by the Danube, reared in the courts and camps of Asia and Gaul, he was absolutely free from any of that superstitious reverence for the ancient glories of the city on the Tiber which had inspired so many of his predecessors. Italy was to him but a secondary province amongst his wide realms. When he distributed his dominions among his heirs, it was Gaul that he gave as the noblest share to his eldest and best-loved son: Italy was to him a younger childs portion. There had been emperors before him who had neglected Rome: the barbarian Maximinus I. had dwelt by the Rhine and the Danube; the politic Diocletian had chosen Nicomedia as his favourite residence. But no one had yet dreamed of raising up a rival to the mistress of the world, and of turning Rome into a provincial town. If preceding emperors had dwelt far afield, it was to meet the exigencies of war on the frontiers or the government of distant provinces. ', 
              'It was reserved for Constantine to erect over against Rome a rival metropolis for the civilized world, an imperial city which was to be neither a mere camp nor a mere court, but the administrative and commercial centre of the Roman world. For more than a hundred years Rome had been a most inconvenient residence for the emperors. The main problem which had been before them was the repelling of incessant barbarian inroads on the Balkan Peninsula; the troubles on the Rhine and the Euphrates, though real enough, had been but minor evils. Rome, placed half way down the long projection of Italy, handicapped by its bad harbours and separated from the rest of the empire by the passes of the Alps, was too far away from the points where the emperor was most wanted--the banks of the Danube and the walls of Sirmium and Singidunum. For the ever-recurring wars with Persia it was even more inconvenient; but these were less pressing dangers; no Persian army had yet penetrated beyond Antioch--only 200 miles from the frontier--while in the Balkan Peninsula the Goths had broken so far into the heart of the empire as to sack Athens and Thessalonica. ', 
              'Constantine, with all the Roman world at his feet, and all its responsibilities weighing on his mind, was far too able a man to overlook the great need of the day--a more conveniently placed administrative and military centre for his empire. He required a place that should be easily accessible by land and sea--which Rome had never been in spite of its wonderful roads--that should overlook the Danube lands, without being too far away from the East; that should be so strongly situated that it might prove an impregnable arsenal and citadel against barbarian attacks from the north; that should at the same time be far enough away from the turmoil of the actual frontier to afford a safe and splendid residence for the imperial court. The names of several towns are given by historians as having suggested themselves to Constantine. First was his own birth-place--Naissus (Nisch) on the Morava, in the heart of the Balkan Peninsula; but Naissus had little to recommend it: it was too close to the frontier and too far from the sea. Sardica--the modern Sofia in Bulgaria--was liable to the same objections, and had not the sole advantage of Naissus, that of being connected in sentiment with the emperors early days. Nicomedia on its long gulf at the east end of the Propontis was a more eligible situation in every way, and had already served as an imperial residence. ', 
              'But all that could be urged in favour of Nicomedia applied with double force to Byzantium, and, in addition, Constantine had no wish to choose a city in which his own memory would be eclipsed by that of his predecessor Diocletian, and whose name was associated by the Christians, the class of his subjects whom he had most favoured of late, with the persecutions of Diocletian and Galerius. For Ilium, the last place on which Constantine had cast his mind, nothing could be alleged except its ancient legendary glories, and the fact that the mythologists of Rome had always fabled that their city drew its origin from the exiled Trojans of AEneas. Though close to the sea it had no good harbour, and it was just too far from the mouth of the Hellespont to command effectually the exit of the Euxine. Byzantium, on the other hand, was thoroughly well known to Constantine. For months his camp had been pitched beneath its walls; he must have known accurately every inch of its environs, and none of its military advantages can have missed his eye. Nothing, then, could have been more natural than his selection of the old Megarian city for his new capital. Yet the Roman world was startled at the first news of his choice; Byzantium had been so long known merely as a great port of call for the Euxine trade, and as a first-class provincial fortress, that it was hard to conceive of it as a destined seat of empire. When once Constantine had determined to make Byzantium his capital, in preference to any other place in the Balkan lands, his measures were taken with his usual energy and thoroughness. The limits of the new city were at once marked out by solemn processions in the old Roman style. ', 
              'In later ages a picturesque legend was told to account for the magnificent scale on which it was planned. The emperor, we read, marched out on foot, followed by all his court, and traced with his spear the line where the new fortifications were to be drawn. As he paced on further and further westward along the shore of the Golden Horn, till he was more than two miles away from his starting-point, the gate of old Byzantium, his attendants grew more and more surprised at the vastness of his scheme. At last they ventured to observe that he had already exceeded the most ample limits that an imperial city could require. But Constantine turned to rebuke them: \"I shall go on,\" he said, \"until He, the invisible guide who marches before me, thinks fit to stop.\" Guided by his mysterious presentiment of greatness, the emperor advanced till he was three miles from the eastern angle of Byzantium, and only turned his steps when he had included in his boundary line all the seven hills which are embraced in the peninsula between the Propontis and the Golden Horn. ', 'The rising ground just outside the walls of the old city, where Constantines tent had been pitched during the siege of A.D. 323, was selected out as the market-place of the new foundation. There he erected the _Milion_, or \"golden milestone,\" from which all the distances of the eastern world were in future to be measured. This \"central point of the world\" was not a mere single stone, but a small building like a temple, its roof supported by seven pillars; within was placed the statue of the emperor, together with that of his venerated mother, the Christian Empress Helena. The south-eastern part of the old town of Byzantium was chosen by Constantine for the site of his imperial palace. The spot was cleared of all private dwellings for a space of 150 acres, to give space not only for a magnificent residence for his whole court, but for spacious gardens and pleasure-grounds. A wall, commencing at the Lighthouse, where the Bosphorus joins the Propontis, turned inland and swept along parallel to the shore for about a mile, in order to shut off the imperial precinct from the city. ', 
              'The Heart of Constantinople North-west of the palace lay the central open space in which the life of Constantinople was to find its centre. This was the \"Augustaeum,\" a splendid oblong forum, about a thousand feet long by three hundred broad. It was paved with marble and surrounded on all sides by stately public buildings. To its east, as we have already said, lay the imperial palace, but between the palace and the open space were three detached edifices connected by a colonnade. Of these, the most easterly was the Great Baths, known, from their builder, as the \"Baths of Zeuxippus.\" They were built on the same magnificent scale which the earlier emperors had used in Old Rome, though they could not, perhaps, vie in size with the enormous Baths of Caracalla. Constantine utilized and enlarged the old public bath of Byzantium, which had been rebuilt after the taking of the city by Severus. ', 
              'He adorned the frontage and courts of the edifice with statues taken from every prominent town of Greece and Asia, the old Hellenic masterpieces which had escaped the rapacious hands of twelve generations of plundering proconsuls and Caesars. There were to be seen the Athene of Lyndus, the Amphithrite of Rhodes, the Pan which had been consecrated by the Greeks after the defeat of Xerxes, and the Zeus of Dodona. Adjoining the Baths, to the north, lay the second great building, on the east side of the Augustaeum--the Senate House. Constantine had determined to endow his new city with a senate modelled on that of Old Rome, and had indeed persuaded many old senatorial families to migrate eastward by judicious gifts of pensions and houses. We know that the assembly was worthily housed, but no details survive about Constantines building, on account of its having been twice destroyed within the century. But, like the Baths of Zeuxippus, it was adorned with ancient statuary, among which the Nine Muses of Helicon are specially cited by the historian who describes the burning of the place in A.D. 404. Linked to the Senate House by a colonnade, lay on the north the Palace of the Patriarch, as the Bishop of Byzantium was ere long to be called, when raised to the same status as his brethren of Antioch and Alexandria. ', 
              'A fine building in itself, with a spacious hall of audience and a garden, the patriarchal dwelling was yet completely overshadowed by the imperial palace which rose behind it. And so it was with the patriarch himself: he lived too near his royal master to be able to gain any independent authority. Physically and morally alike he was too much overlooked by his august neighbour, and never found the least opportunity of setting up an independent spiritual authority over against the civil government, or of founding an _imperium in imperio_ like the Bishop of Rome. The Atmeidan Hippodrome And St. Sophia. All along the western side of the Augustaeum, facing the three buildings which we have already described, lay an edifice which played a very prominent part in the public life of Constantinople. This was the great Hippodrome, a splendid circus 640 cubits long and 160 broad, in which were renewed the games that Old Rome had known so well. The whole system the chariot-races between the teams that represented the \"factions\" of the Circus was reproduced at Byzantium with an energy that even surpassed the devotion of the Romans to horse racing. ', 
              'From the first foundation of the city the rivalry of the \"Blues\" and the \"Greens\" was one of the most striking features of the life of the place. It was carried far beyond the circus, and spread into all branches of life. We often hear of the \"Green\" faction identifying itself with Arianism, or of the \"Blue\" supporting a pretender to the throne. Not merely men of sporting interests, but persons of all ranks and professions, chose their colour and backed their faction. The system was a positive danger to the public peace, and constantly led to riots, culminating in the great sedition of A.D. 523, which we shall presently have to describe at length. In the Hippodrome the \"Greens\" always entered by the north-eastern gate, and sat on the east side; the \"Blues\" approached by the north-western gate and stretched along the western side. The emperors box, called the Kathisma, occupied the whole of the short northern side, and contained many hundreds of seats for the imperial retinue. The great central throne of the Kathisma was the place in which the monarch showed himself most frequently to his subjects, and around it many strange scenes were enacted. It was on this throne that the rebel Hypatius was crowned emperor by the mob, with his own wifes necklace for an impromptu diadem. Here also, two centuries later, the Emperor Justinian II. sat in state after his reconquest of Constantinople, with his rivals, Leontius and Apsimarus, bound beneath his footstool, while the populace chanted, in allusion to the names of the vanquished princes, the verse, \"Thou shalt trample on the Lion and the Asp.\" Down the centre of the Hippodrome ran the \"spina,\" or division wall, which every circus showed; it was ornamented with three most curious monuments, whose strange juxtaposition seemed almost to typify the heterogeneous materials from which the new city was built up. ', 
              'The first and oldest was an obelisk brought from Egypt, and covered with the usual hieroglyphic inscriptions; the second was the most notable, though one of the least beautiful, of the antiquities of Constantinople: it was the three-headed brazen serpent which Pausanias and the victorious Greeks had dedicated at Delphi in 479 B.C., after they had destroyed the Persian army at Plataea. The golden tripod, which was supported by the heads of the serpents, had long been wanting: the sacrilegious Phocians had stolen it six centuries before; but the dedicatory inscriptions engraved on the coils of the pedestal survived then and survive now to delight the archaeologist. The third monument on the \"spina\" was a square bronze column of more modern work, contrasting strangely with the venerable antiquity of its neighbours. By some freak of chance all three monuments have remained till our own day: the vast walls of the Hippodrome have crumbled away, but its central decorations still stand erect in the midst of an open space which the Turks call the Atmeidan, or place of horses, in dim memory of its ancient use. Along the outer eastern wall of the Hippodrome on the western edge of the Augustaeum, stood a range of small chapels and statues, the most important landmark among them being the _Milion_ or central milestone of the empire, which we have already described. ', 
              'The statues, few at first, were increased by later emperors, till they extended along the whole length of the forum. Constantines own contribution to the collection was a tall porphyry column surmounted by a bronze image which had once been the tutelary Apollo of the city of Hierapolis, but was turned into a representation of the emperor by the easy method of knocking off its head and substituting the imperial features. It was exactly the reverse of a change which can be seen at Rome, where the popes have removed the head of the Emperor Aurelius, and turned him into St. Peter, on the column in the Corso. Building A Palace (from a Byzantine MS.) North of the Hippodrome stood the great church which Constantine erected for his Christian subjects, and dedicated to the Divine Wisdom (_Hagia Sophia_). It was not the famous domed edifice which now bears that name, but an earlier and humbler building, probably of the Basilica-shape then usual. Burnt down once in the fifth and once in the sixth centuries, it has left no trace of its original character. From the west door of St. Sophia a wooden gallery, supported on arches, crossed the square, and finally ended at the \"Royal Gate\" of the palace. By this the emperor would betake himself to divine service without having to cross the street of the Chalcoprateia (brass market), which lay opposite to St. Sophia. The general effect of the gallery must have been somewhat like that of the curious passage perched aloft on arches which connects the Pitti and Uffizi palaces at Florence. The edifices which we have described formed the heart of Constantinople. Between the Palace, the Hippodrome, and the Cathedral most of the important events in the history of the city took place. But to north and west the city extended for miles, and everywhere there were buildings of note, though no other cluster could vie with that round the Augustaeum. The Church of the Holy Apostles, which Constantine destined as the burying-place of his family, was the second among the ecclesiastical edifices of the town. Of the outlying civil buildings, the public granaries along the quays, the Golden Gate, by which the great road from the west entered the walls, and the palace of the praetorian praefect, who acted as governor of the city, must all have been well worthy of notice. ', 
              'A statue of Constantine on horseback, which stood by the last-named edifice, was one of the chief shows of Constantinople down to the end of the Middle Ages, and some curious legends gathered around it. Fifteenth-Century Drawing Of The Equestrian Statue Of Constantine. It was in A.D. 328 or 329--the exact date is not easily to be fixed--that Constantine had definitely chosen Byzantium for his capital, and drawn out the plan for its development. As early as May 11, 330, the buildings were so far advanced that he was able to hold the festival which celebrated its consecration. Christian bishops blessed the partially completed palace, and held the first service in St. Sophia; for Constantine, though still unbaptized himself, had determined that the new city should be Christian from the first. Of paganism there was no trace in it, save a few of the old temples of the Byzantines, spared when the older streets were levelled to clear the ground for the palace and adjoining buildings. The statues of the gods which adorned the Baths and Senate House stood there as works of art, not as objects of worship. To fill the vast limits of his city, Constantine invited many senators of Old Rome and many rich provincial proprietors of Greece and Asia to take up their abode in it, granting them places in his new senate and sites for the dwellings they would require. The countless officers and functionaries of the imperial court, with their subordinates and slaves, must have composed a very considerable element in the new population. The artizans and handicraftsmen were enticed in thousands by the offer of special privileges. Merchants and seamen had always abounded at Byzantium, and now flocked in numbers which made the old commercial prosperity of the city seem insignificant. Most effective--though most demoralizing--of the gifts which Constantine bestowed on the new capital to attract immigrants was the old Roman privilege of free distribution of corn to the populace. The wheat-tribute of Egypt, which had previously formed part of the public provision of Rome, was transferred to the use of Constantinople, only the African corn from Carthage being for the future assigned for the subsistence of the older city. On the completion of the dedication festival in 330 A.D. an imperial edict gave the city the title of New Rome, and the record was placed on a marble tablet near the equestrian statue of the emperor, opposite the Strategion. But \"New Rome\" was a phrase destined to subsist in poetry and rhetoric alone: the world from the first very rightly gave the city the founders name only, and persisted in calling it Constantinople. ')

chapters <- tibble::tibble(chapter=1:2, text=c(ch1, ch2))
str(chapters)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
##  $ chapter: int  1 2
##  $ text   : chr  "Two thousand five hundred and fifty-eight years ago a little fleet of galleys toiled painfully against the curr"| __truncated__ "When the fall of Byzantium had wrecked the fortunes of Licinius, the Roman world was again united beneath the s"| __truncated__
# Specify the input column
word_freq <- chapters %>% 
    tidytext::unnest_tokens(output=word, input=text, token="words", format="text") %>% 
    # Obtain word frequencies
    count(chapter, word) 

# Test equality
word_freq %>% 
    filter(word == "after")
## # A tibble: 2 x 3
##   chapter word      n
##     <int> <chr> <int>
## 1       1 after    11
## 2       2 after     4
corpus <- data.frame(text=c('Due to bad loans, the bank agreed to pay the fines', 'If you are late to pay off your loans to the bank, you will face fines', 'A new restaurant opened in downtown', 'There is a new restaurant that just opened on Warwick street', 'How will you pay off the loans you will need for the restaurant you want opened?'), 
                     id=paste0("id_", 1:5), 
                     stringsAsFactors = FALSE
                     )
corpus
##                                                                               text
## 1                               Due to bad loans, the bank agreed to pay the fines
## 2           If you are late to pay off your loans to the bank, you will face fines
## 3                                              A new restaurant opened in downtown
## 4                     There is a new restaurant that just opened on Warwick street
## 5 How will you pay off the loans you will need for the restaurant you want opened?
##     id
## 1 id_1
## 2 id_2
## 3 id_3
## 4 id_4
## 5 id_5
# The call to posterior(mod)$topics returns the probabilities of topics.
dtm <- corpus %>% 
    # Specify the input column
    tidytext::unnest_tokens(input=text, output=word, drop=TRUE) %>% 
    count(id, word) %>% 
    # Specify the token
    tidytext::cast_dtm(document=id, term=word, value=n)

mod = topicmodels::LDA(x=dtm, k=2, method="Gibbs", control=list(alpha=1, delta=0.1, seed=10005))

modeltools::posterior(mod)$topics
##          1     2
## id_1 0.615 0.385
## id_2 0.444 0.556
## id_3 0.250 0.750
## id_4 0.615 0.385
## id_5 0.111 0.889
# Generate the document-term matrix
dtm <- corpus %>% 
    tidytext::unnest_tokens(input=text, output=word) %>% 
    count(id, word) %>% 
    tidytext::cast_dtm(document=id, term=word, value=n)

# Run the LDA for two topics
mod <- topicmodels::LDA(x=dtm, k=2, method="Gibbs",control=list(alpha=1, delta=0.1, seed=10005))

# Retrieve the probabilities of word `will` belonging to topics 1 and 2
broom::tidy(mod, matrix="beta") %>%
    filter(term == "will")
## # A tibble: 2 x 3
##   topic term     beta
##   <int> <chr>   <dbl>
## 1     1 will  0.00379
## 2     2 will  0.0767
# Make a stacked column chart showing the probabilities of documents belonging to topics
broom::tidy(mod, matrix="gamma") %>% 
    mutate(topic = as.factor(topic)) %>% 
    ggplot(aes(x=document, y=gamma)) + 
    geom_col(aes(fill=topic))


Chapter 2 - Word Clouds, Stop Words, Control Arguments

Random Nature of LDA Algorithm:

  • The LDA call has a random component - random search through the parameter space to find the best match (uses log-likelihood)
    • Gibbs sampling - a type of Monte Carlo Markov Chain (MCMC) algorithm
    • Tries different combinations of probabilities of topics in documents, and probabilities of words in topics: e.g. (0.5, 0.5) vs. (0.8, 0.2)
    • The combinations are influenced by parameters alpha and delta
    • Argument seed sets the starting point for the pseudo-random number generator
    • Setting seeds ensures replication of results between runs
    • Argument iter controls the number of iterations of algorithm - more iterations takes longer while maximizing the likelihood of finding the best solution
  • Intermediate results can be managed using control parameters for C code that is called by topicmodels
    • topicmodels calls a piece of code written in C
    • Argument thin specifies how often to return the result of search
    • control=list(thin=1)
    • Setting thin=1 will return result for every step, and the best one will be picked
    • Most efficient, but slows down the execution
  • Approach for getting the most probable words in each topic
    • LDA model object contains matrix beta with probabilities of words in topics
    • Use function tidy to extract
    • If we want to get top 5 words from each topic:
    • Retrieve the matrix by calling tidy(model, matrix=“beta”) and sort by probabilities, filter by row number
    • tidy(mod, matrix=“beta”) %>% group_by(topic) %>% arrange(desc(beta)) %>% filter(row_number() <=3) %>% ungroup() %>% arrange(topic, desc(beta))
  • Can get the top-k words (or words above a threshhold) using the function terms()
    • Function terms from topicmodels will return either top k words or all words with probability above threshold
    • terms(mod, k=5)
    • terms(mod, threshold=0.05)

Manipulating Vocabulary:

  • Often there are either words we want to keep or words we want to discard (stopwords)
    • Stopwords are service words that are considered as noise and must be removed - they obscure word associations in topics
  • Can use the anti_join for filtering out stopwords
    • tidytext comes with a table stop_words containing stop words from several lexicons
    • d = data.frame(term=c(“we”, “went”, “fishing”, “slept”), count=c(2, 1, 3, 1), stringsAsFactors = F)
    • d %>% anti_join(stop_words, by=c(“term”=“word”))
  • Can instead keep a targeted list of needed words using inner_join()
    • d = data.frame(term=c(“we”, “went”, “fishing”, “slept”), count=c(2, 1, 3, 1), stringsAsFactors = F)
    • dictionary = data.frame(term=c(“fishing”, “slept”), stringsAsFactors = F)
    • d %>% inner_join(dictionary, by=“term”)

Word Clouds:

  • Word clouds can be better than bar charts when there are a large number of words
    • wordcloud will draw a cloud of text labels, with font size proportionate to frequency of the word
    • Required arguments - a vector of words, and the vector of word frequencies
    • No need to sort the words by frequency
    • word_frequencies <- corpus %>% unnest_tokens(input=text, output=word) %>% count(word)
  • The wordcloud::wordcloud() function takes word frequencies and several key arguments
    • Specify number of words shown max.words
    • Specify the range of word frequencies, min.freq and max.freq
    • wordcloud::wordcloud(words=word_frequencies\(word, freq=word_frequencies\)n, min.freq=1, max.words=20)
  • Can further add colors and rotations to the wordcloud for an improved aesthetic
    • colors takes a vector of colors
    • rot.per is percentage of rotated words. Default is 0.1
    • wordcloud::wordcloud(words=word_frequencies\(word, freq=word_frequencies\)n, min.freq=1, colors=c(“DarkOrange”, “CornflowerBlue”, “DarkRed”), rot.per=0.3, max.words=20)
  • Need to convert LDA data (percentages) to data formatted as expected by wordcloud (integers)
    • wordcloud expects integer values for word frequencies
    • LDA returns probabilities - decimal fractions
    • Solution: multiply by a large number, truncate the fractional part
    • mod <- LDA(x=dtm, k=2, method=“Gibbs”, control=list(alpha=1, thin=1, seed=10005))
    • word_frequencies <- tidy(mod, matrix=“beta”) %>% mutate(n = trunc(beta * 10000)) %>% filter(topic == 1)
    • wordcloud::wordcloud(words=word_frequencies\(term, freq=word_frequencies\)n, max.words=20, colors=c(“DarkOrange”, “CornflowerBlue”, “DarkRed”), rot.per=0.3)

History of the Byzantine Empire:

  • The Byzantine Empire (East Rome) existed from 330 CE to 1453 CE - capital was in Constantinople
    • The text: The Byzantine Empire, by Charles Oman, printed in 1902, available from Project Guttenberg (https://www.gutenberg.org/)
    • Twenty six chapters arranged in chronological order
    • Package gutenbergr enables direct download of texts - dataframe with lines of text
    • Dataframe history with two columns: text and chapter
  • Goal is to find predominant themes in each of the periods, then compare with domain expertise and plot using ggplot2

Example code includes:

dtm <- dtm[, c("bank", "fines", "loans", "pay", "new", "opened", "restaurant")]
dtm
## <<DocumentTermMatrix (documents: 5, terms: 7)>>
## Non-/sparse entries: 18/17
## Sparsity           : 49%
## Maximal term length: 10
## Weighting          : term frequency (tf)
# Display column names
colnames(dtm)
## [1] "bank"       "fines"      "loans"      "pay"        "new"       
## [6] "opened"     "restaurant"
# Fit an LDA model for 2 topics using Gibbs sampling
mod <- topicmodels::LDA(x=dtm, k=2, method="Gibbs", control=list(alpha=1, seed=10005, thin=1))

# Convert matrix beta into tidy format and filter on topic number and term
broom::tidy(mod, matrix="beta") %>%
    filter(topic==2, term=="opened")
## # A tibble: 1 x 3
##   topic term    beta
##   <int> <chr>  <dbl>
## 1     2 opened 0.320
# The call to posterior(mod)$topics returns the probabilities of topics.
dtm <- corpus %>% 
    # Specify the input column
    tidytext::unnest_tokens(input=text, output=word, drop=TRUE) %>% 
    count(id, word) %>% 
    # Specify the token
    tidytext::cast_dtm(document=id, term=word, value=n)


# Fit LDA topic model using Gibbs sampling for 2 topics
mod1 <- topicmodels::LDA(x=dtm, k=2, method="Gibbs", control=list(alpha=1, seed=10005, thin=1))

# Display the probabilities of topics in documents side by side
broom::tidy(mod1, "gamma") %>% spread(topic, gamma)
## # A tibble: 5 x 3
##   document   `1`    `2`
##   <chr>    <dbl>  <dbl>
## 1 id_1     0.308 0.692 
## 2 id_2     0.278 0.722 
## 3 id_3     0.875 0.125 
## 4 id_4     0.923 0.0769
## 5 id_5     0.389 0.611
# Fit LDA topic model using Gibbs sampling for 2 topics
mod2 <- topicmodels::LDA(x=dtm, k=2, method="Gibbs", control=list(alpha=25, seed=10005, thin=1))

# Display the probabilities of topics in documents side by side
broom::tidy(mod2, "gamma") %>% spread(topic, gamma)
## # A tibble: 5 x 3
##   document   `1`   `2`
##   <chr>    <dbl> <dbl>
## 1 id_1     0.443 0.557
## 2 id_2     0.5   0.5  
## 3 id_3     0.518 0.482
## 4 id_4     0.557 0.443
## 5 id_5     0.485 0.515
# Create the document-term matrix
dtm <- corpus %>%
    tidytext::unnest_tokens(output=word, input=text) %>%
    count(id, word) %>%
    tidytext::cast_dtm(document=id, term=word, value=n)

# Display dtm as a matrix
as.matrix(dtm)
##       Terms
## Docs   agreed bad bank due fines loans pay the to are face if late off
##   id_1      1   1    1   1     1     1   1   2  2   0    0  0    0   0
##   id_2      0   0    1   0     1     1   1   1  2   1    1  1    1   1
##   id_3      0   0    0   0     0     0   0   0  0   0    0  0    0   0
##   id_4      0   0    0   0     0     0   0   0  0   0    0  0    0   0
##   id_5      0   0    0   0     0     1   1   2  0   0    0  0    0   1
##       Terms
## Docs   will you your a downtown in new opened restaurant is just on street
##   id_1    0   0    0 0        0  0   0      0          0  0    0  0      0
##   id_2    1   2    1 0        0  0   0      0          0  0    0  0      0
##   id_3    0   0    0 1        1  1   1      1          1  0    0  0      0
##   id_4    0   0    0 1        0  0   1      1          1  1    1  1      1
##   id_5    2   3    0 0        0  0   0      1          1  0    0  0      0
##       Terms
## Docs   that there warwick for how need want
##   id_1    0     0       0   0   0    0    0
##   id_2    0     0       0   0   0    0    0
##   id_3    0     0       0   0   0    0    0
##   id_4    1     1       1   0   0    0    0
##   id_5    0     0       0   1   1    1    1
# Create the document-term matrix with stop words removed
dtm <- corpus %>%
    tidytext::unnest_tokens(output=word, input=text) %>%
    anti_join(tidytext::stop_words) %>% 
    count(id, word) %>%
    tidytext::cast_dtm(document=id, term=word, value=n)
## Joining, by = "word"
# Display the matrix
as.matrix(dtm)
##       Terms
## Docs   agreed bad bank due fines loans pay late downtown restaurant street
##   id_1      1   1    1   1     1     1   1    0        0          0      0
##   id_2      0   0    1   0     1     1   1    1        0          0      0
##   id_3      0   0    0   0     0     0   0    0        1          1      0
##   id_4      0   0    0   0     0     0   0    0        0          1      1
##   id_5      0   0    0   0     0     1   1    0        0          1      0
##       Terms
## Docs   warwick
##   id_1       0
##   id_2       0
##   id_3       0
##   id_4       1
##   id_5       0
dictionary <- tibble::tibble(word=c("bank", "fines", "loans", "pay", "new", "opened", "restaurant"))
dictionary
## # A tibble: 7 x 1
##   word      
##   <chr>     
## 1 bank      
## 2 fines     
## 3 loans     
## 4 pay       
## 5 new       
## 6 opened    
## 7 restaurant
# Perform inner_join with the dictionary table
dtm <- corpus %>%
    tidytext::unnest_tokens(output=word, input=text) %>%
    inner_join(dictionary) %>%
    count(id, word) %>%
    tidytext::cast_dtm(document=id, term=word, value=n)
## Joining, by = "word"
# Display the summary of dtm
as.matrix(dtm)
##       Terms
## Docs   bank fines loans pay new opened restaurant
##   id_1    1     1     1   1   0      0          0
##   id_2    1     1     1   1   0      0          0
##   id_3    0     0     0   0   1      1          1
##   id_4    0     0     0   0   1      1          1
##   id_5    0     0     1   1   0      1          1
# Generate the counts of words in the corpus
word_frequencies <- corpus %>% 
    tidytext::unnest_tokens(input=text, output=word) %>%
    count(word)

# Create a wordcloud
wordcloud::wordcloud(words=word_frequencies$word, freq=word_frequencies$n, min.freq=1, max.words=10, 
                     colors=c("DarkOrange", "Blue"), random.order=FALSE, random.color=FALSE
                     )

# DO NOT HAVE FILE 'history'
# Construct a document-term matrix
# dtm <- history %>% 
#     tidytext::unnest_tokens() %>%
#     anti_join(stop_words) %>% 
#     count(chapter, word) %>% 
#     tidytext::cast_dtm(document=chapter, term=word, value=n)
# 
# # Insert the missing arguments
# mod <- topicmodels::LDA(x=dtm, k=4, method="Gibbs", control=list(alpha=1, seed=10005))
# 
# # Display top 15 words of each topic
# terms(mod, k=15)
# 
# 
# # Display the structure of the verbs dataframe
# str(verbs)
# 
# # Construct a document-term matrix
# dtm <- history %>% 
#     tidytext::unnest_tokens() %>%
#     inner_join(verbs, by=c("word"="past")) %>% 
#     count(chapter, word) %>% 
#     tidytext::cast_dtm(document=chapter, term=word, value=n)
# 
# # Fit LDA for four topics
# mod <- topicmodels::LDA(x=dtm, k=4, method="Gibbs", control=list(alpha=1, seed=10005))
# 
# # Display top 25 words from each topic
# terms(mod, k=25)
# 
# 
# # Extract matrix gamma and plot it
# broom::tidy(mod, "gamma") %>% 
#     mutate(document=as.numeric(document)) %>% 
#     ggplot(aes(x=document, y=gamma)) + 
#     geom_line(aes(color=factor(topic))) + 
#     labs(x="Chapter", y="Topic probability") +
#     scale_color_manual(values=brewer.pal(n=4, "Set1"), name="Topic")
# 
# 
# 
# # Display wordclouds one at a time
# for (j in 1:4) {
#     # Generate a table with word frequences for topic j
#     word_frequencies <- broom::tidy(mod, matrix="beta") %>% 
#         mutate(n = trunc(beta * 10000)) %>% 
#         filter(topic == j)
#     # Display word cloud
#     wordcloud::wordcloud(word = word_frequencies$term, freq = word_frequencies$n, max.words = 20,
#                          scale = c(3, 0.5), colors = c("DarkOrange", "CornflowerBlue", "DarkRed"), 
#                          rot.per = 0.3
#                          )
# }

Chapter 3 - Named Entity Recognition as Unsupervised Classification

Using Topic Models as Classifiers:

  • Can use a topic model as a soft classifier - find probability of belonging to a specific class
    • Named entity recognition - is an entity a geographic name or a person?
    • “Washington crossed the Delaware river” vs. “They did a road trip across Washington”
  • The control paramters play an important role
    • A very high alpha may drive 50-50 splits everywhere - greater than 1 puts most of the distribution in the center, less than 1 puts most of the distribution at the edges
  • Bag of M&Ms as an example for how an LDA fits a model - multinomial distribution
    • Several outcomes (colors of candy) repeated n times
    • Each outcome has its own probability - fixed by the factory that filled the bag
    • Probabilities sum up to 1
    • What’s the probability of getting 5 yellow, 2 brown, 2 blue, and 1 black when we take out 10 pieces of candy?
    • In LDA model, topics are color, and there are two “bags of candy”: one for documents, one for words
  • The Dirichlet aspect of LDA
    • Randomized search for the best values of probabilities
    • E.g., try 0.2, 0.5 and 0.3 for proportions of three topics
    • Hard to do for large number of topics and words
    • Instead, use values from Dirichlet distribution as guesses
    • Dirichlet distribution returns a set of numbers that add up to 1 - serve as probabilities of colors for M&M candy bags

From Word Windows to DTM:

  • An entity is a personalized noun - often capitalized
    • If we take n words on the left and n words on the right of an entity, we get a word window
    • E.g., attention of Megara was turned - entity is Megara and context is attention was turned
    • Also possible to tag words to differentiate the side - attention_L1 of_L2 was_R1 turned_R2
  • Can combine the contexts of the same entity using paste to create a document
    • docs <- df %>% group_by(entity) %>% summarise(doc_id = first(entity), text = paste(text, collapse=" “))
  • Can use regex to find entities in a document - take advantage of entities being capitalized
    • pattern <- “[A-Z][a-z]+”
    • m <- gregexpr(text, pattern)
    • entities <- unlist(regmatches(text, m))
  • Can also run regular expressions using parentheses to group some patterns together
    • Some entities include St. in them, e.g. St. Sophia
    • p <- “(St[.] )?[A-Z][a-z]+”
    • (St[.] ) is a group. The ? quantifier means the group is optional
  • Can also use capture groups to add a suffix
    • t <- “the great Darius threw across”
    • gsub(“^([a-z]+) ([a-z]+)”, “\1_L1 \2_L2”, t) # result is “the_L1 great_L2 Darius threw across”
    • Two groups, each matches a lowercase word [a-z]+
    • The ^ is an anchor - specifies position in the string. ^ - the start, $ - at the end
    • The \1 is back-reference to contents of group 1. Its contents are substituted

Corpus Alignment and Classification:

  • Can also run models for unsupervised classification - telling the meaning of a named entity
    • topics <- tidy(mod, matrix=“gamma”) %>% spread(topic, gamma)
    • topics %>% filter(document %in% c(" Alboin“,” Alexander“,” Asia Minor“,” Amorium“,” Cappadocian“))
  • Ideally, want to use a pre-trained model, as is possible using topicmodels::posterior()
    • model = LDA(…)
    • result = posterior(model, new_data)
    • result$topics
    • new_data must be aligned with the vocabulary used in the model
    • LDA algorithm iterates over items and their counts, does not “know” that it’s dealing with words
  • Corpus alignment involves ensuring the training and test set have the same vocabularies and rules
    • Drop words from dtm that are not part of model’s vocabulary
    • Function tidy with matrix=“beta” extracts the terms and their probabilities
    • model_vocab <- tidy(mod, matrix=“beta”) %>% select(term) %>% distinct()
    • Do right-join with the model’s vocabulary to keep only the words the model was trained on
    • new_table <- new_doc %>% unnest_tokens(input=text, output=word) %>% count(doc_id, word) %>% right_join(model_vocab, by=c(“word”=“term”))
  • Can have challenges with NA introduced in to the process - as document name (set to first) and as counts (set to 0)
    • new_dtm <- new_table %>% arrange(desc(doc_id)) %>% mutate(doc_id = ifelse(is.na(doc_id), first(doc_id), doc_id), n = ifelse(is.na(n), 0, n)) %>% cast_dtm(document=doc_id, term=word, value=n)
  • Hold-out data is valuable, similar to any other machine learning process
    • Held-out data for testing
    • Hold out a percentage of full records (same as with test datasets in ML)
    • Hold out a percentage of terms inside a document (unique to topic modeling)
    • Estimate quality of fit by looking at the log-likelihood
    • held-out log-likelihood
    • Our case: withhold full documents, no cross-validation

Example code includes:

# The call to posterior(mod)$topics returns the probabilities of topics.
dtm <- corpus %>% 
    # Specify the input column
    tidytext::unnest_tokens(input=text, output=word, drop=TRUE) %>% 
    count(id, word) %>% 
    # Specify the token
    tidytext::cast_dtm(document=id, term=word, value=n)


# Fit a topic model using LDA with Gibbs sampling
mod = topicmodels::LDA(x=dtm, k=2, method="Gibbs", control=list(iter=500, thin=1, seed = 12345, alpha=NULL))

# Display topic prevalance in documents as a table
broom::tidy(mod, "gamma") %>% spread(topic, gamma)  
## # A tibble: 5 x 3
##   document   `1`   `2`
##   <chr>    <dbl> <dbl>
## 1 id_1     0.475 0.525
## 2 id_2     0.485 0.515
## 3 id_3     0.536 0.464
## 4 id_4     0.525 0.475
## 5 id_5     0.485 0.515
# Fit the model for delta = 0.1
mod <- topicmodels::LDA(x=dtm, k=2, method="Gibbs", control=list(iter=500, seed=12345, alpha=1, delta=0.1))

# Define which words we want to examine
my_terms = c("loans", "bank", "opened", "pay", "restaurant", "you")

# Make a tidy table
t <- broom::tidy(mod, "beta") %>% 
    filter(term %in% my_terms)

# Make a stacked column chart of word probabilities
ggplot(t, aes(x=term, y=beta)) + 
    geom_col(aes(fill=factor(topic))) +
    theme(axis.text.x=element_text(angle=90))

# Fit the model for delta = 0.5
mod <- topicmodels::LDA(x=dtm, k=2, method="Gibbs", control=list(iter=500, seed=12345, alpha=1, delta=0.5))

# Define which words we want to examine
my_terms = c("loans", "bank", "opened", "pay", "restaurant", "you")

# Make a tidy table
t <- broom::tidy(mod, "beta") %>% 
    filter(term %in% my_terms)

# Make a stacked column chart
ggplot(t, aes(x=term, y=beta)) + 
    geom_col(aes(fill=factor(topic))) +
    theme(axis.text.x=element_text(angle=90))

text <- c(ch1, ch2)

# Regex pattern for an entity and word context
p1 <- "( [a-z]+){2}( (St[.] )?[A-Z][a-z]+)+( [a-z]+){2}"

# Obtain the regex match object from gregexpr
m <- gregexpr(p1, text)

# Get the matches and flatten the list
v <- unlist(regmatches(text, m))

# Find the number of elements in the vector
length(v)
## [1] 116
# Regex pattern for an entity and word context
p2 <- "( [a-z]+){2}( (St[.] )?[A-Z][a-z]+( (of|the) [A-Z][a-z]+)?)+( [a-z]+){2}"

# Obtain the regex match object from gregexpr
m <- gregexpr(p2, text)

# Get the matches and flatten the list
v <- unlist(regmatches(text, m))

# Find the number of elements in the vector
length(v)
## [1] 121
entity_pattern <- "( (St[.] )?[A-Z][a-z]+( (of|the) [A-Z][a-z]+)?)+"
v <- c(' into the European shore of', ' settlers were Greeks of the', ' cities of Hellas in the', ' to the West to plant', ' attention of Megara was turned', ' of the Black Sea and the', ' behind the North Wind and know', ' wonders the Greeks sailed ever', ' of the Golden Fleece they did', ' metals of Colchis and the', ' of the Bosphorus and the', ' foundation of Byzantium was but', ' years before Byzantium came into', ' band of Megarian colonists had', ' the opposite Asiatic shore of', ' oracle of Delphi to give', ' that the Chalcedonians were truly', ' less inviting Bithynian side of', ' marked out Byzantium as destined', ' with all Europe behind it', ' early history Byzantium never fell', ' by the Danube mouth or', ' a hundred Hellenic towns on', ' house alone Byzantium would have', ' independent state Byzantium had a', ' the great Darius threw across', ' his son Xerxes crossed the', ' down the Bosphorus to pay', ' from the Oriental yoke seventeen', ' was at Byzantium that the', ' empire of Athens were laid', ' all the Greek states of', ' fifth century Byzantium twice declared', ' and the Byzantines escaped anything', ' blessing gave Byzantium its chief', ' wines of Maronea and other', ' that the Byzantines were eating', ' rise of Philip of Macedon and his', ' by the Byzantines as a', ' after repulsing Philip the Byzantines had to', ' power of Rome invaded the', ' regions of Thrace and the', ' that the Senate gave it', ' till the Roman Republic had long', ' that the Emperor Vespasian stripped it', ' for itself Byzantium lay on', ' and the Syrian emperor put', ' garrison of Byzantium refused to', ' magistrates of Byzantium were slain', ' to the Byzantines the right', ' on the Black Sea whose commerce', ' the old Megarian race who', ' we find Byzantium again a', ' the historian Trebellius Pollio for the', ' repelled a Gothic raid in', ' from the Roman Empire the ruin', ' on the Bithynian side of', ' importance of Byzantium was always', ' abdication of Diocletian the empire', ' and took Byzantium by surprise')

# Print out contents of the `entity_pattern`
entity_pattern
## [1] "( (St[.] )?[A-Z][a-z]+( (of|the) [A-Z][a-z]+)?)+"
# Remove the named entity from text
v2 <- gsub(entity_pattern, "", v)

# Display the head of v2
head(v2)
## [1] " into the shore of"       " settlers were of the"   
## [3] " cities of in the"        " to the to plant"        
## [5] " attention of was turned" " of the and the"
# Remove the named entity
v2 <- gsub(entity_pattern, "", v)

# Pattern for inserting suffixes
p <- "\\1_L1 \\2_L2 \\3_R1 \\4_R2"

# Add suffixes to words
context <- gsub("([a-z]+) ([a-z]+) ([a-z]+) ([a-z]+)", p, v2)

# Extract named entity and use it as document ID
doc_id <- unlist(regmatches(v, gregexpr(entity_pattern, v)))

# Make a data frame with columns doc_id and text
corpus <- data.frame(doc_id = doc_id, text = context, stringsAsFactors = F)


# Summarize the text to produce a document for each doc_id
corpus2 <- corpus %>% 
    group_by(doc_id) %>% 
    summarise(doc = paste(text, collapse=" "))

# Make a document-term matrix
dtm <- corpus2 %>% 
    tidytext::unnest_tokens(input=doc, output=word) %>% 
    count(doc_id, word) %>%
    tidytext::cast_dtm(document=doc_id, term=word, value=n)

# Fit an LDA model for 3 topics
mod <- topicmodels::LDA(x=dtm, k=3, method="Gibbs", control=list(alpha=1, seed=12345, iter=1000, thin=1))

# Create a table with probabilities of topics in documents
topics <- broom::tidy(mod, matrix="gamma") %>% 
    spread(topic, gamma)


# Set random seed for reproducability
set.seed(12345)

# Take a sample of 20 random integers, without replacement
r <- sample.int(n=nrow(corpus2), size=20, replace=FALSE)

# Generate a document-term matrix
train_dtm <- corpus2[-r, ] %>% 
    tidytext::unnest_tokens(input=doc, output=word) %>% 
    count(doc_id, word) %>% 
    tidytext::cast_dtm(document=doc_id, term=word, value=n)

# Fit an LDA topic model for k=3
train_mod <- topicmodels::LDA(x=train_dtm, k=3, method="Gibbs", 
                              control=list(alpha=1, seed=10001, iter=1000, thin=1)
                              )


# Get the test row indices
set.seed(12345)
r <- sample.int(n=nrow(corpus2), size=20, replace=FALSE)

# Extract the vocabulary of the training model
model_vocab <- broom::tidy(train_mod, matrix="beta") %>% 
    select(term) %>% 
    distinct()

# Create a table of counts with aligned vocabularies
test_table <- corpus2[r, ] %>% 
    tidytext::unnest_tokens(input=doc, output=word) %>% 
    count(doc_id, word) %>%
    right_join(model_vocab, by=c("word"="term"))

# Prepare a document-term matrix
test_dtm <- test_table %>% 
    arrange(desc(doc_id)) %>% 
    mutate(doc_id = ifelse(is.na(doc_id), first(doc_id), doc_id), n = ifelse(is.na(n), 0, n)) %>% 
    tidytext::cast_dtm(document=doc_id, term=word, value=n)


# Obtain posterior probabilities for test documents
results <- modeltools::posterior(object=train_mod, newdata=test_dtm)

# Display the matrix with topic probabilities
results$topics
##                            1     2      3
##  Thrace                0.500 0.333 0.1667
##  Syrian                0.600 0.200 0.2000
##  Senate                0.400 0.200 0.4000
##  Roman Republic        0.600 0.200 0.2000
##  Roman Empire          0.500 0.250 0.2500
##  Philip the Byzantines 0.400 0.400 0.2000
##  Oriental              0.500 0.250 0.2500
##  Megarian              0.200 0.400 0.4000
##  Maronea               0.400 0.400 0.2000
##  Hellenic              0.250 0.250 0.5000
##  Golden Fleece         0.400 0.200 0.4000
##  European              0.167 0.667 0.1667
##  Emperor Vespasian     0.400 0.200 0.4000
##  Diocletian            0.250 0.500 0.2500
##  Danube                0.500 0.250 0.2500
##  Colchis               0.500 0.333 0.1667
##  Chalcedonians         0.400 0.400 0.2000
##  Byzantines            0.750 0.167 0.0833
##  Black Sea             0.750 0.125 0.1250
##  Bithynian             0.167 0.667 0.1667

Chapter 4 - How Many Topics is Enough?

Finding the Best Number of Topics:

  • Several approaches to picking the optimal number of topics
    • Topic coherence - examine the words in topics, decide if they make sense (e.g. site, settlement, excavation, popsicle - low coherence)
    • Quantitative measures
    • Log-likelihood - how plausible model parameters are given the data
    • Perplexity - model’s “surprise” at the data
  • The log-likelihood measures the plausibility of the parameters given the data - logs allow for summing rather than multiplying
    • Likelihood - measure of how plausible model parameters are given the data
    • Taking a logarithm makes calculations easier
    • All values are negative: when x<1, log(x) < 0
    • Numerical optimization - search for the largest log-likelihood (-100 is better than -105)
    • Function logLik returns log-likelihood of an LDA model
  • Perplexity is a measure of the model’s surprise with the data - positive number, smaller is better
    • perplexity(object=mod, newdata=dtm)
  • Can run results for multiple values of k, then plot and find optimal performance on log-likelihood and perplexity
    • Similar to clustering, the goal is to find an elbow point
    • Searching for best k can take a lot of time
    • Factors: number of documents, number of terms, and number of iterations
    • Model fitting can be resumed
    • Function LDA accepts an LDA model as an object for initialization
    • mod = LDA(x=dtm, method=“Gibbs”, k=4, control=list(alpha=0.5, seed=12345, iter=1000, keep=1))
    • mod2 = LDA(x=dtm, model=mod, control=list(thin=1, seed=10000, iter=200))

Topic Models Fitted to Novels:

  • A topic model can be used to analyze one long work - for example, a novel such as “Moby Dick”
    • Documents are chunks long enough to capture an event or a scene in the plot
    • For traditional novels - 1000+ words (roughly 3 pages of text)
  • Text chunks can be considered as chapters
    • corpus %>% unnest_tokens(input=text, output=word) %>% count(chapter, word)
    • With text chunks, we need to generate the “chapter number” on our own
    • Candidate function: %/% - integer division
    • corpus %>%
    • unnest_tokens(input=text, output=word) %>%
    • mutate(word_index = 1:n()) %>%
    • mutate(doc_number = word_index %/% 1000 + 1) %>%
    • count(doc_number, word) %>%
    • cast_dtm(term=word, document=doc_number, value=n)
  • Picking a good chunk size is more an art than a science

Locking Topics With Seed Words:

  • Passing seeds for random numbers helps to ensure reproducibility
    • LDA performs randomized search through the space of parameters - Gibbs sampling
    • Topic numbering is unstable (topic 1 need not be the same content area when re-run)
  • Can initialize the Gibbs sampler with seed words, which helps to lock in some of the topic numbers
    • Lock topic numbers
    • Specify weights for seed words for topics
    • seedwords requires a matrix, k rows, N columns
    • k is number of topics, N is vocabulary size
    • Weights get normalized internally so they sum up to 1 (need not be passed already summing to one)
  • Example for a tiny dataset with dtm of 5x34 and k=2
    • Matrix for seed words needs to be 2x34 - the number of rows is k and the number of columns matches the dtm
    • seedwords = matrix(nrow=2, ncol=34, data=0)
    • colnames(seedwords) = colnames(dtm)
    • seedwords[1, “restaurant”] = 1
    • seedwords[2, “loans”] = 1
    • lda_mod = LDA(x=dtm, k=2, method=“Gibbs”, control=list(alpha=1, seed=1234))
    • tidy(lda_mod, “beta”) %>% spread(key=topic,value=beta) %>% filter(term %in% c(“restaurant”,“loans”))
    • lda_mod = LDA(x=dtm, k=2, method=“Gibbs”, seedwords=seedwords, control=list(alpha=1, seed=1234))
    • tidy(lda_mod, “beta”) %>% spread(key=topic,value=beta) %>% filter(term %in% c(“restaurant”,“loans”))
  • Seed words are useful for two reasons
    • Convenient for pre-trained models
    • Training a model involves multiple runs of the algorithm, even for the same k
    • Seedwords let us “lock” topic numbers
    • Helpful input for training models
    • Speed up algorithm convergence by providing a starting point

Wrap Up:

  • LDA topic modeling returns a probability (soft classifier)
  • LDA requires count data as an input - can scale/transform non-integer data to counts and use successfully in LDA
  • Variable Expectation Management (VEM) can also be applied to topic modeling
    • Can be applied to correlated topic models
    • Topic proportions follow a multivariate normal distribution
  • Package stm by Margaret Roberts, Brandon Stewart, Dustin Lingley, and Kenneth Benoit
    • regression modeling of topic proportions and covariates
    • automatic corpus alignment
    • held-out data as omitted words in documents
    • can use result of LDA model as a seed
  • Deep learning and word embeddings (word-to-vector) predict word meanings based on word adjacencies - Word2Vec
    • Use deep learning neural network to predict words that occur adjacent to a word, n±n with n=2n=2, or 44
    • Transform into a vector of smaller dimensions (25, 50, 100)
    • word2vec models use very large corpora (e.g., 2 billion words)
    • do not make accommodations for multi-word entities
    • take a long time to train
  • Experiment with package wordVectors created by Ben Schmidt

Example code includes:

# DO NOT HAVE THE DATA FRAME df
# Split the Abstract column into tokens
dtm <- df %>% 
   tidytext::unnest_tokens(input=Abstract, output=word) %>% 
   # Remove stopwords
   anti_join(stop_words) %>% 
   # Count the number of occurrences
   count(AwardNumber, word) %>% 
   # Create a document term matrix
   tidytext::cast_dtm(document=AwardNumber, term=word, value=n)


dtm <- df %>% 
   tidytext::unnest_tokens(input=Abstract, output=word) %>% 
   anti_join(stop_words) %>% 
   # Count occurences
   count(AwardNumber, word) %>%
   # Group the data
   group_by(word) %>% 
   # Filter for document wide frequency
   filter(sum(n) >= 10) %>% 
   # Ungroup the data andreate a document term matrix
   ungroup() %>% 
   tidytext::cast_dtm(document=AwardNumber, term=word, value=n)


# Create a LDA model
mod <- topicmodels::LDA(x=dtm, method="Gibbs", k=3, control=list(alpha=0.5, seed=1234, iter=500, thin=1))

# Retrieve log-likelihood
topicmodels::logLik(mod)

# Find perplexity
topicmodels::perplexity(object=mod, newdata=dtm)


# Display names of elements in the list
names(models[[1]])

# Retrieve the values of k and perplexity, and plot perplexity vs k
x <- sapply(models, '[[', "k")
y <- sapply(models, '[[', "perplexity")
plot(x, y, xlab="number of clusters, k", ylab="perplexity score", type="o")

# Record the new perplexity scores
new_perplexity_score <- numeric(length(models))

# Run each model for 100 iterations
for (i in seq_along(models)) {
    mod2 <- topicmodels::LDA(x=dtm, model=models[[i]]$model, control=list(iter=100, seed=12345, thin=1))
    new_perplexity_score[i] <- topicmodels::perplexity(object=mod2, newdata=dtm)
}

# Specify the possible values of k and build the plot
k <- 2:10
plot(x=k, y=new_perplexity_score, xlab="number of clusers, k", ylab="perplexity score", type="o")


t <- history %>% 
    # Unnest the tokens
    tidytext::unnest_tokens(input=text, output=word) %>% 
    # Create a word index column
    mutate(word_index = row_number()) %>% 
    # Create a document number column
    mutate(document_number = word_index %/% 1000 + 1)


dtm <- t %>% 
    # Join verbs on "word" and "past"
    inner_join(verbs, by=c("word"="past")) %>% 
    # Count word
    count(document_number, word) %>%
    # Create a document-term matrix
    tidytext::cast_dtm(document=document_number, term=word, value=n)


# Store the names of documents in a vector
required_documents <- c(" Africa", " Emperor Heraclius", " Adrianople", " Daniel", " African")

# Convert table into wide format
broom::tidy(mod, matrix="gamma") %>% 
    spread(key=topic, value=gamma) %>% 
    # Keep only the rows with document names matching the required documents
    filter(document %in% required_documents)


# Set up the column names
colnames(seedwords) <- colnames(dtm)

# Set the weights
seedwords[1, "defeated_l2"] = 1
seedwords[2, "across_l2"] = 1

# Fit the topic model
mod <- topicmodels::LDA(dtm, k=3, method="Gibbs", seedwords=seedwords, 
                        control=list(alpha=1, iter=500, seed=1234)
                        )

# Examine topic assignment in the fitted model
broom::tidy(mod, "gamma") %>% spread(topic, gamma) %>% 
    filter(document %in% c(" Daniel", " Adrianople", " African"))

Intermediate Interactive Data Visualization with plotly in R

Chapter 1 - Introduction and Review of plotly

Interactive and Dynamic Graphics:

  • Interactive graphics can provide insights over and above standard plots
  • The plotly library is still under development and allows for interactive and dynamic plots
  • Should consider the report’s intended usage - static, interactive, dynamic
    • Interactive - changes in response to user input, such as zoom in, see the label when hovering over a point, etc.
    • Dynamic - change automatically (e.g., animation)
  • Example for running plotly
    • msci %>% plot_ly(x=~Date, y=~Close) %>% add_lines()

Adding Aesthetics to Represent a Variable:

  • Can add variables for attributes such as color or point size
    • add_markers(color=~income) # income is a categorical in this example
    • add_markers(symbol=~income) # plotly calls symbol a glyph, and will also color by symbol
    • add_markers(color=~population) # continuous metrics with continuous scale
    • add_markers(color=~log10(population)) # continuous metrics
    • add_markers(size=~population) # continuous metrics
  • Can further polish the labels and add axis labels
    • happy %>% plot_ly(x=~x, y=~y, hoverinfo=“text”, text=~paste(“My A:”, A, “
      My B:”, B) %>% add_markers(size=~A) %>% layout(xaxis=list(title=“My X Label”), yaxis=list(title=“My Y Label”))

Moving Beyond Simple Interactivity:

  • Can extend on interaction to include animations and linked brushing
  • Example for creating a static bubble chart
    • myData %>% plot_ly(x=~myX, y=~myY, hoverinfo=“text”, text=~myLabel) %>% add_markers(size=~mySize, color=~myColor, marker=list(opacity=0.5, sizemode=“diameter”, sizeref=2))
  • Can create linked brushing where clicking a bar chart can also update the points highlighted in a scatter plot

Example code includes:

# load the plotly package
library(plotly)


acwiDate <- as.Date(c('2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06', '2017-01-09', '2017-01-10', '2017-01-11', '2017-01-12', '2017-01-13', '2017-01-17', '2017-01-18', '2017-01-19', '2017-01-20', '2017-01-23', '2017-01-24', '2017-01-25', '2017-01-26', '2017-01-27', '2017-01-30', '2017-01-31', '2017-02-01', '2017-02-02', '2017-02-03', '2017-02-06', '2017-02-07', '2017-02-08', '2017-02-09', '2017-02-10', '2017-02-13', '2017-02-14', '2017-02-15', '2017-02-16', '2017-02-17', '2017-02-21', '2017-02-22', '2017-02-23', '2017-02-24', '2017-02-27', '2017-02-28', '2017-03-01', '2017-03-02', '2017-03-03', '2017-03-06', '2017-03-07', '2017-03-08', '2017-03-09', '2017-03-10', '2017-03-13', '2017-03-14', '2017-03-15', '2017-03-16', '2017-03-17', '2017-03-20', '2017-03-21', '2017-03-22', '2017-03-23', '2017-03-24', '2017-03-27', '2017-03-28', '2017-03-29', '2017-03-30', '2017-03-31', '2017-04-03', '2017-04-04', '2017-04-05', '2017-04-06', '2017-04-07', '2017-04-10', '2017-04-11', '2017-04-12', '2017-04-13', '2017-04-17', '2017-04-18', '2017-04-19', '2017-04-20', '2017-04-21', '2017-04-24', '2017-04-25', '2017-04-26', '2017-04-27', '2017-04-28', '2017-05-01', '2017-05-02', '2017-05-03', '2017-05-04', '2017-05-05', '2017-05-08', '2017-05-09', '2017-05-10', '2017-05-11', '2017-05-12', '2017-05-15', '2017-05-16', '2017-05-17', '2017-05-18', '2017-05-19', '2017-05-22', '2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26', '2017-05-30', '2017-05-31', '2017-06-01', '2017-06-02', '2017-06-05', '2017-06-06', '2017-06-07', '2017-06-08', '2017-06-09', '2017-06-12', '2017-06-13', '2017-06-14', '2017-06-15', '2017-06-16', '2017-06-19', '2017-06-20', '2017-06-21', '2017-06-22', '2017-06-23', '2017-06-26', '2017-06-27', '2017-06-28', '2017-06-29', '2017-06-30', '2017-07-03', '2017-07-05', '2017-07-06', '2017-07-07', '2017-07-10', '2017-07-11', '2017-07-12', '2017-07-13', '2017-07-14', '2017-07-17', '2017-07-18', '2017-07-19', '2017-07-20', '2017-07-21', '2017-07-24', '2017-07-25', '2017-07-26', '2017-07-27', '2017-07-28', '2017-07-31', '2017-08-01', '2017-08-02', '2017-08-03', '2017-08-04', '2017-08-07', '2017-08-08', '2017-08-09', '2017-08-10', '2017-08-11', '2017-08-14', '2017-08-15', '2017-08-16', '2017-08-17', '2017-08-18', '2017-08-21', '2017-08-22', '2017-08-23', '2017-08-24', '2017-08-25', '2017-08-28', '2017-08-29', '2017-08-30', '2017-08-31', '2017-09-01', '2017-09-05', '2017-09-06', '2017-09-07', '2017-09-08', '2017-09-11', '2017-09-12', '2017-09-13', '2017-09-14', '2017-09-15', '2017-09-18', '2017-09-19', '2017-09-20', '2017-09-21', '2017-09-22', '2017-09-25', '2017-09-26', '2017-09-27', '2017-09-28', '2017-09-29', '2017-10-02', '2017-10-03', '2017-10-04', '2017-10-05', '2017-10-06', '2017-10-09', '2017-10-10', '2017-10-11', '2017-10-12', '2017-10-13', '2017-10-16', '2017-10-17', '2017-10-18', '2017-10-19', '2017-10-20', '2017-10-23', '2017-10-24', '2017-10-25', '2017-10-26', '2017-10-27', '2017-10-30', '2017-10-31', '2017-11-01', '2017-11-02', '2017-11-03', '2017-11-06', '2017-11-07', '2017-11-08', '2017-11-09', '2017-11-10', '2017-11-13', '2017-11-14', '2017-11-15', '2017-11-16', '2017-11-17', '2017-11-20', '2017-11-21', '2017-11-22', '2017-11-24', '2017-11-27', '2017-11-28', '2017-11-29', '2017-11-30', '2017-12-01', '2017-12-04', '2017-12-05', '2017-12-06', '2017-12-07', '2017-12-08', '2017-12-11', '2017-12-12', '2017-12-13', '2017-12-14', '2017-12-15', '2017-12-18', '2017-12-19', '2017-12-20', '2017-12-21', '2017-12-22', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29'))
acwiOpen <- c(59.61, 59.87, 60.15, 60.35, 60.22, 60.24, 60.25, 60.34, 60.54, 60.4, 60.47, 60.43, 60.43, 60.45, 60.49, 61.08, 61.41, 61.31, 60.89, 60.85, 61.13, 60.85, 61.26, 61.08, 61.12, 61.01, 61.3, 61.58, 61.95, 62, 62.08, 62.49, 62.19, 62.58, 62.61, 62.93, 62.32, 62.48, 62.47, 62.83, 63, 62.82, 62.75, 62.64, 62.63, 62.44, 62.78, 62.93, 62.74, 62.91, 63.65, 63.62, 63.58, 63.79, 62.83, 62.94, 63.09, 62.69, 63.07, 63.15, 63.42, 63.24, 63.31, 63.09, 63.43, 63.14, 63.07, 63.1, 63.13, 63.11, 62.88, 62.83, 62.75, 62.95, 62.96, 63.07, 63.94, 64.23, 64.34, 64.39, 64.34, 64.44, 64.6, 64.47, 64.52, 64.7, 64.95, 64.91, 64.84, 64.92, 64.95, 65.23, 65.58, 65.06, 64.41, 64.91, 65.36, 65.53, 65.48, 65.78, 65.78, 65.66, 65.91, 65.87, 66.44, 66.54, 66.3, 66.36, 66.4, 66.42, 66.06, 66.38, 66.71, 65.72, 66.16, 66.5, 65.86, 65.47, 65.46, 65.5, 65.93, 65.73, 65.62, 65.86, 65.6, 65.67, 65.54, 65.32, 65.09, 65.36, 65.48, 65.93, 66.24, 66.4, 66.74, 66.62, 66.88, 67.25, 67.02, 67.05, 67.22, 67.34, 67.59, 67.03, 67.35, 67.54, 67.79, 67.46, 67.56, 67.52, 67.53, 67.11, 67.03, 66.34, 66.8, 66.94, 67.14, 66.95, 66.27, 66.32, 66.63, 66.77, 67, 67.01, 67.11, 66.6, 66.97, 67.3, 67.73, 67.48, 67.4, 67.76, 67.72, 68.04, 68.46, 68.45, 68.3, 68.5, 68.74, 68.81, 68.88, 68.82, 68.69, 68.51, 68.37, 68.33, 68.26, 68.59, 68.77, 69, 69.17, 69.25, 69.21, 69.45, 69.57, 69.7, 69.73, 70, 70.03, 69.97, 70.14, 69.82, 70.19, 70.25, 70.02, 70.04, 69.87, 69.93, 70.06, 70.21, 70.63, 70.41, 70.48, 70.57, 70.73, 70.71, 70.39, 70.37, 70.02, 70.14, 69.7, 70.21, 70.33, 70.48, 70.89, 71.15, 71.35, 71.31, 71.32, 71.66, 71.67, 71.61, 71.88, 71.88, 71.01, 70.94, 71.58, 71.66, 71.79, 72.04, 72.07, 71.88, 72.54, 72.03, 71.98, 71.83, 71.97, 72.01, 72.04, 72.3, 72.39)
acwiVolume <- 1000 * c(2576.7, 1087.3, 1717.3, 1233.8, 1471.1, 1393.6, 1508.8, 1481.4, 2432.3, 2090, 2246.2, 2137.9, 1611.1, 1991.5, 1641.1, 2840.2, 2160.1, 1003.4, 4898.2, 2894.4, 3562.9, 1657.2, 1997.7, 1144.1, 1062.4, 1950.6, 1028, 1650, 2759.1, 2269.6, 2227.5, 2579.4, 2296.1, 2893.7, 2520.1, 3967.3, 1936.7, 1179.7, 1733, 2769.6, 1960.8, 1665.1, 930.8, 1061.5, 1612.4, 1679.7, 2343.2, 1273.7, 1780.8, 3030.7, 2072.3, 1595.4, 1362.9, 2535.6, 1795.2, 3562.7, 1503.5, 1775.4, 4208, 2474.4, 1574.5, 2063.4, 2034.7, 1977.5, 2895.8, 1652.6, 952.9, 1817.8, 1235.2, 1185.6, 1809.8, 2123.6, 1750.8, 976.5, 1681.4, 2036.3, 1944.4, 2231.6, 1360.9, 1897.3, 1511.7, 1268.5, 3350.2, 938.8, 703.2, 1007.2, 869.8, 1010.1, 932.1, 2738.8, 967.7, 1258.5, 676.4, 1467.8, 1974.6, 1207, 1498.8, 1682.5, 962.7, 1089.5, 1248.6, 1226.7, 1130.4, 3266.9, 3152.7, 1609.4, 1270.5, 2328.8, 1655.5, 1555.6, 3059.1, 850.7, 1955.1, 1871, 1534.7, 1774.6, 2155.1, 1116.5, 2049.5, 1096.9, 3644.7, 1694.3, 1859.1, 1690.4, 3787.3, 1465.5, 3957.6, 2196.1, 2179.2, 761.9, 1039.2, 1509.3, 1474.7, 774.2, 1197.8, 2261.4, 1640.3, 1669.5, 881, 1580.2, 1884.3, 4450.3, 1594.1, 1183.3, 2577.7, 2403.4, 1061.5, 623.1, 1141.4, 917.6, 859.4, 606.6, 1778.9, 1512.2, 1033, 871.8, 1123.2, 1594, 1434.6, 1029.3, 1013.7, 864.3, 4048.5, 989.2, 748.9, 1259.3, 781.2, 1045.1, 1675.7, 4411.3, 1557.2, 1050.2, 610.4, 961.4, 1144, 1079.9, 358.2, 969.8, 1256.6, 1515.8, 2005, 1264.2, 951.8, 4745.4, 1015.3, 801, 3610.4, 2570.3, 3559.1, 1349.5, 1151, 846.5, 1034.1, 625.7, 1088.8, 1111.9, 1388.6, 650.2, 2709.7, 1315.7, 2027.4, 1260.3, 1314.8, 876.1, 677.4, 920, 1039.3, 960.1, 771.1, 1247.4, 1493.2, 1420.1, 1157.6, 764.1, 1731.1, 536.9, 1439.9, 551.3, 789.8, 1108.1, 1198.4, 998.6, 591.1, 889.4, 872.3, 1325.8, 700.6, 1098.1, 1015.9, 1623.8, 7719.2, 6765.1, 2408.8, 2596.6, 1169.7, 1864.7, 1167.7, 5803.3, 1731, 1855.7, 2458.4, 1700.7, 2077.2, 1458.3, 2557.6, 2240.2, 1107.5, 4644.2, 3342.1, 1402.7, 1104.4)
acwi <- tibble::tibble(Date=acwiDate, Open=acwiOpen, Volume=acwiVolume)
str(acwi)


# Create a times series plot of Open against Date
acwi %>% 
    plot_ly(x = ~Date, y = ~Open) %>% 
    add_lines()


# Create a scatterplot with Date on the x-axis 
# and Volume on the y-axis
acwi %>% 
    plot_ly(x = ~Date, y = ~Volume) %>%
    add_markers()


happyLE <- c(52.3, 69.1, 65.7, 67.5, 65.1, 72.8, 72.4, 63.1, 66.1, 62.5, 66.6, 72.1, 51.8, 60.3, 68, 58, 65.5, 66.4, 52.4, 58.6, 50.1, 44.6, 46, 69.6, 69.3, 64.1, 55.3, 50.8, 69.9, 67, 72.8, 71.5, 71.7, 63.5, 67.3, 61.6, 64.3, 67.3, 56.7, 71.7, 72.6, 57.1, 64.3, 71.1, 54.9, 71.8, 63.4, 51.2, 53.3, 63.8, 76.5, 67.2, 72.8, 59.5, 60.6, 66, 61, 71.7, 71.9, 74.1, 47, 65.8, 75.3, 64.4, 64.2, 58.7, 62.4, 65.4, 62.9, 57.9, 65.1, 68.9, 52.8, 61.5, 67, 72.2, 65.9, 57, 54.2, 49.2, 71.8, 53.3, 65.6, 68.1, 63.7, 62.3, 67.1, 65.3, 50, 57.6, 55.5, 61.3, 71.6, 71.6, 66.3, 51.3, 45.9, 71.1, 57.5, 63.1, 68, 65.4, 60.3, 69.1, 72.2, 66.9, 63, 64.1, 58, 65.7, 44.4, 75.8, 68.8, 70.9, 55.2, 74, 50, 74.5, 65.3, 73, 73.2, 71.2, 63, 56.7, 66.4, 52.3, 61.7, 65.9, 65.8, 60.4, 51.8, 63.2, 68.6, 72.1, 69.8, 68.4, 63.2, 66.2, 55, 53.8, 52.7)
happyHappy <- c(2.66, 4.64, 5.25, 6.04, 4.29, 7.26, 7.29, 5.15, 6.23, 4.31, 5.55, 6.93, 4.85, 5.65, 5.09, 3.5, 6.33, 5.1, 4.65, 4.59, 5.07, 3.48, 4.56, 6.32, 5.1, 6.16, 4.88, 4.31, 7.23, 5.34, 6.06, 6.79, 7.59, 5.61, 5.84, 3.93, 6.34, 5.94, 4.18, 7.79, 6.64, 4.78, 4.45, 7.07, 5.48, 5.15, 6.33, 4.87, 3.82, 6.02, 5.36, 6.07, 7.48, 4.05, 5.1, 4.72, 4.46, 7.06, 7.33, 6.2, 5.04, 5.89, 5.91, 4.81, 5.88, 4.48, 6.15, 6.09, 5.63, 4.62, 5.98, 5.15, 4.42, 5.65, 6.27, 7.06, 5.23, 4.08, 3.42, 4.74, 6.68, 4.68, 6.17, 6.41, 5.33, 5.33, 5.61, 5.31, 4.28, 4.15, 4.44, 4.74, 7.46, 7.33, 6.48, 4.62, 5.32, 7.58, 5.83, 4.63, 6.57, 5.71, 5.59, 6.2, 5.71, 6.09, 5.58, 6.29, 4.68, 5.12, 4.09, 6.38, 6.37, 6.17, 4.51, 5.87, 2.82, 6.23, 4.33, 7.29, 7.47, 6.36, 5.83, 3.35, 5.94, 4.36, 6.19, 4.12, 5.61, 5.23, 4, 4.31, 7.04, 7.1, 6.99, 6.34, 6.42, 5.18, 3.25, 3.93, 3.64)
happyGDP <- c(7.46, 9.37, 9.54, 9.84, 9.03, 10.71, 10.72, 9.65, 10.69, 8.16, 9.72, 10.66, 7.63, 8.83, 9.37, 9.68, 9.55, 9.82, 7.43, 8.2, 8.13, NA, 7.49, 10.04, 9.64, 9.49, 8.56, 6.63, 9.67, 10.01, 10.36, 10.39, 10.75, 9.59, 9.22, 9.26, 9, 10.29, 7.44, 10.61, 10.56, 9.72, 9.17, 10.71, 8.33, 10.12, 8.92, 7.53, 7.4, 8.4, 10.92, 10.19, 10.76, 8.77, 9.32, 9.85, 9.62, 11.07, 10.41, 10.47, 8.18, 9.03, 10.57, 9.03, 10.07, 8.01, NA, 11.11, 8.11, 8.71, 10.13, 9.49, 6.63, NA, 10.28, 11.47, 9.5, 7.25, 7, 7.61, 10.5, 8.19, 9.91, 9.74, 8.55, 9.32, 9.69, 8.92, 7.05, 8.63, 9.2, 7.8, 10.79, 10.48, 8.58, 6.83, 8.59, 11.08, 8.52, NA, 10, 9.41, 8.94, 10.21, 10.24, 10.03, 10.1, 10.81, 7.81, 9.56, 7.25, 11.32, 10.31, 10.35, 9.41, 10.49, NA, 10.45, 9.38, 10.77, 10.96, NA, 7.96, 7.9, 9.69, 7.25, 10.32, 9.29, 10.12, 9.7, 7.44, 8.97, 11.12, 10.58, 10.9, 9.92, 8.76, 8.74, NA, 8.21, 7.54)
happyRegion <- c('South Asia', 'Central and Eastern Europe', 'Middle East and North Africa', 'Latin America and Caribbean', 'Commonwealth of Independent States', 'North America and ANZ', 'Western Europe', 'Commonwealth of Independent States', 'Middle East and North Africa', 'South Asia', 'Commonwealth of Independent States', 'Western Europe', 'Sub-Saharan Africa', 'Latin America and Caribbean', 'Central and Eastern Europe', 'Sub-Saharan Africa', 'Latin America and Caribbean', 'Central and Eastern Europe', 'Sub-Saharan Africa', 'Southeast Asia', 'Sub-Saharan Africa', 'Sub-Saharan Africa', 'Sub-Saharan Africa', 'Latin America and Caribbean', 'East Asia', 'Latin America and Caribbean', 'Sub-Saharan Africa', 'Sub-Saharan Africa', 'Latin America and Caribbean', 'Central and Eastern Europe', 'Western Europe', 'Central and Eastern Europe', 'Western Europe', 'Latin America and Caribbean', 'Latin America and Caribbean', 'Middle East and North Africa', 'Latin America and Caribbean', 'Central and Eastern Europe', 'Sub-Saharan Africa', 'Western Europe', 'Western Europe', 'Sub-Saharan Africa', 'Commonwealth of Independent States', 'Western Europe', 'Sub-Saharan Africa', 'Western Europe', 'Latin America and Caribbean', 'Sub-Saharan Africa', 'Latin America and Caribbean', 'Latin America and Caribbean', 'East Asia', 'Central and Eastern Europe', 'Western Europe', 'South Asia', 'Southeast Asia', 'Middle East and North Africa', 'Middle East and North Africa', 'Western Europe', 'Middle East and North Africa', 'Western Europe', 'Sub-Saharan Africa', 'Latin America and Caribbean', 'East Asia', 'Middle East and North Africa', 'Commonwealth of Independent States', 'Sub-Saharan Africa', 'Central and Eastern Europe', 'Middle East and North Africa', 'Commonwealth of Independent States', 'Southeast Asia', 'Central and Eastern Europe', 'Middle East and North Africa', 'Sub-Saharan Africa', 'Middle East and North Africa', 'Central and Eastern Europe', 'Western Europe', 'Central and Eastern Europe', 'Sub-Saharan Africa', 'Sub-Saharan Africa', 'Sub-Saharan Africa', 'Western Europe', 'Sub-Saharan Africa', 'Sub-Saharan Africa', 'Latin America and Caribbean', 'Commonwealth of Independent States', 'East Asia', 'Central and Eastern Europe', 'Middle East and North Africa', 'Sub-Saharan Africa', 'Southeast Asia', 'Sub-Saharan Africa', 'South Asia', 'Western Europe', 'North America and ANZ', 'Latin America and Caribbean', 'Sub-Saharan Africa', 'Sub-Saharan Africa', 'Western Europe', 'South Asia', 'Middle East and North Africa', 'Latin America and Caribbean', 'Latin America and Caribbean', 'Southeast Asia', 'Central and Eastern Europe', 'Western Europe', 'Central and Eastern Europe', 'Commonwealth of Independent States', 'Middle East and North Africa', 'Sub-Saharan Africa', 'Central and Eastern Europe', 'Sub-Saharan Africa', 'Southeast Asia', 'Central and Eastern Europe', 'Central and Eastern Europe', 'Sub-Saharan Africa', 'East Asia', 'Sub-Saharan Africa', 'Western Europe', 'South Asia', 'Western Europe', 'Western Europe', 'East Asia', 'Commonwealth of Independent States', 'Sub-Saharan Africa', 'Southeast Asia', 'Sub-Saharan Africa', 'Latin America and Caribbean', 'Middle East and North Africa', 'Middle East and North Africa', 'Commonwealth of Independent States', 'Sub-Saharan Africa', 'Commonwealth of Independent States', 'Middle East and North Africa', 'Western Europe', 'North America and ANZ', 'Latin America and Caribbean', 'Commonwealth of Independent States', 'Southeast Asia', 'Middle East and North Africa', 'Sub-Saharan Africa', 'Sub-Saharan Africa')
happyIncome <- factor(c('low', 'upper-middle', 'upper-middle', 'high', 'upper-middle', 'high', 'high', 'upper-middle', 'high', 'lower-middle', 'upper-middle', 'high', 'low', 'lower-middle', 'upper-middle', 'upper-middle', 'upper-middle', 'upper-middle', 'low', 'lower-middle', 'lower-middle', 'low', 'low', 'high', 'upper-middle', 'upper-middle', 'NA', 'NA', 'upper-middle', 'high', 'high', 'high', 'high', 'upper-middle', 'upper-middle', 'NA', 'lower-middle', 'high', 'low', 'high', 'high', 'upper-middle', 'lower-middle', 'high', 'lower-middle', 'high', 'upper-middle', 'low', 'low', 'lower-middle', 'NA', 'high', 'high', 'lower-middle', 'lower-middle', 'NA', 'upper-middle', 'high', 'high', 'high', 'NA', 'upper-middle', 'high', 'upper-middle', 'upper-middle', 'lower-middle', 'lower-middle', 'high', 'NA', 'NA', 'high', 'upper-middle', 'low', 'upper-middle', 'high', 'high', 'NA', 'low', 'low', 'low', 'high', 'lower-middle', 'upper-middle', 'upper-middle', 'lower-middle', 'lower-middle', 'upper-middle', 'lower-middle', 'low', 'lower-middle', 'upper-middle', 'low', 'high', 'high', 'lower-middle', 'low', 'lower-middle', 'high', 'lower-middle', 'NA', 'high', 'upper-middle', 'lower-middle', 'high', 'high', 'upper-middle', 'NA', 'high', 'low', 'upper-middle', 'low', 'high', 'NA', 'high', 'upper-middle', 'NA', 'low', 'high', 'lower-middle', 'high', 'high', 'NA', 'low', 'low', 'upper-middle', 'low', 'high', 'lower-middle', 'upper-middle', 'upper-middle', 'low', 'lower-middle', 'high', 'high', 'high', 'high', 'lower-middle', 'lower-middle', 'NA', 'lower-middle', 'low'), levels=c("low", "lower-middle", "upper-middle", "high"))
happySS <- c(0.491, 0.638, 0.807, 0.907, 0.698, 0.95, 0.906, 0.787, 0.876, 0.713, 0.9, 0.922, 0.436, 0.779, 0.775, 0.768, 0.905, 0.942, 0.785, 0.765, 0.695, 0.32, 0.661, 0.88, 0.772, 0.909, 0.655, 0.67, 0.922, 0.77, 0.819, 0.901, 0.952, 0.894, 0.849, 0.638, 0.829, 0.936, 0.734, 0.964, 0.931, 0.807, 0.59, 0.892, 0.669, 0.753, 0.826, 0.634, 0.647, 0.843, 0.831, 0.877, 0.967, 0.607, 0.796, 0.714, 0.695, 0.943, 0.916, 0.92, 0.661, 0.913, 0.882, 0.815, 0.914, 0.715, 0.792, 0.853, 0.883, 0.707, 0.895, 0.777, 0.685, 0.823, 0.926, 0.905, 0.8, 0.626, 0.555, 0.741, 0.937, 0.779, 0.91, 0.8, 0.831, 0.924, 0.881, 0.641, 0.678, 0.795, 0.828, 0.816, 0.937, 0.955, 0.838, 0.582, 0.733, 0.95, 0.69, 0.824, 0.912, 0.83, 0.851, 0.882, 0.9, 0.811, 0.896, 0.84, 0.744, 0.884, 0.652, 0.897, 0.913, 0.928, 0.87, 0.807, 0.557, 0.903, 0.823, 0.914, 0.95, 0.891, 0.663, 0.705, 0.877, 0.508, 0.916, 0.717, 0.876, 0.908, 0.74, 0.858, 0.836, 0.937, 0.921, 0.914, 0.942, NA, 0.79, 0.744, 0.754)
happyCountry <- c('Afghanistan', 'Albania', 'Algeria', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh', 'Belarus', 'Belgium', 'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Cambodia', 'Cameroon', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Congo (Brazzaville)', 'Congo (Kinshasa)', 'Costa Rica', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Estonia', 'Ethiopia', 'Finland', 'France', 'Gabon', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Guatemala', 'Guinea', 'Haiti', 'Honduras', 'Hong Kong S.A.R. of China', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Liberia', 'Libya', 'Lithuania', 'Luxembourg', 'Macedonia', 'Madagascar', 'Malawi', 'Mali', 'Malta', 'Mauritania', 'Mauritius', 'Mexico', 'Moldova', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Pakistan', 'Palestinian Territories', 'Panama', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Romania', 'Russia', 'Saudi Arabia', 'Senegal', 'Serbia', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia', 'South Africa', 'South Korea', 'South Sudan', 'Spain', 'Sri Lanka', 'Sweden', 'Switzerland', 'Taiwan Province of China', 'Tajikistan', 'Tanzania', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay', 'Uzbekistan', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe')
happy <- tibble::tibble(region=happyRegion, life.expectancy=happyLE, social.support=happySS, 
                        happiness=happyHappy, log.gdp=happyGDP, income=happyIncome, 
                        country=happyCountry
                        )
str(happy)


# Create a coded scatterplot of happiness vs. life.expectancy
happy %>%
    plot_ly(x=~life.expectancy, y=~happiness) %>%
    add_markers(color=~region, size=~log.gdp)


# Fill in the specified plotting symbols
happy %>%
    plot_ly(x = ~life.expectancy, y = ~happiness) %>%
    add_markers(symbol = ~income, symbols = c("circle-open", "square-open", "star-open", "x-thin-open"))


# Complete the following code to polish the plot
happy %>%
    plot_ly(x = ~social.support, y = ~happiness, hoverinfo = "text",
            text = ~paste("Country: ", country, "<br> Income: ", income,
                          "<br> Happiness: ", round(happiness, 2),
                          "<br> Social support: ", round(social.support, 2)
                          )
            ) %>%
    add_markers(symbol = ~income, symbols = c("circle-open", "square-open", "star-open", "x-thin-open")) %>%
    layout(xaxis = list(title="Social support index"), yaxis = list(title="National happiness score"))


us_economy <- readr::read_csv("./RInputFiles/state_economic_data.csv")
str(us_economy, give.attr=FALSE)
launches <- readr::read_csv("./RInputFiles/launches.csv")
str(launches, give.attr=FALSE)


# Change the sizemode so that size refers to the diameter of the points
us_economy %>%
    filter(year == 2017) %>%
    plot_ly(x = ~gdp, y = ~house_price) %>%
    add_markers(size = ~population, color = ~region, marker = list(sizemode="diameter"))


# Create a line chart of house_price over time by state
us_economy %>%
    filter(year >= 2000) %>%
    group_by(state) %>%
    plot_ly(x = ~year, y = ~house_price) %>%
    add_lines()

Chapter 2 - Animating Graphics

Introduction to Animated Graphics:

  • Animated graphics can help show changes in data over time, especially when there are a large number of time periods such that faceting would be difficult/impossible
  • Example for the base aesthetic
    • myDF %>% plot_ly(x=~myX, y=~myY) %>% add_markers(frame=~myAnimateVar, showlegend=FALSE) %>% layout(xaxis=list(type=“log”), yaxis=list(type=“log”)) # the frame= variable controls the animation
  • May want to keep the glyphs consant (not changing when there is an NA above/below them) using the ids variable
    • myDF %>% plot_ly(x=~myX, y=~myY) %>% add_markers(frame=~myAnimateVar, ids=~myIDVar, showlegend=FALSE) %>% layout(xaxis=list(type=“log”), yaxis=list(type=“log”)) # the frame= variable controls the animation

Polishing Animations:

  • Can polish aspects of the animation such as time per change or slider appearance
  • Can use the animation_opts() function to change the animation
    • ani %>% animation_opts(frame=500, transition=frame, easing=“linear”, redraw=TRUE) # assumes ani is a properly built plot_ly with animation; frame=500 means 500ms per frame (lower means faster transitions)
    • Note that there is a pause of transition-frame between each transition
    • Can set easing=“bounce” to change right to the next frame
  • Can change the appearance of the slider
    • ani %>% animation_slider(hide=TRUE) # remove the slider
    • ani %>% animation_slider(currentvalue=list(prefix=NULL, font=list(color=“black”, fontsize=40)))

Layering in plotly:

  • Can layer traces, similar to the process of layering in ggplot2
  • Example of producing a text layer first (large year), then layering the rest of the plot on top
    • myDF %>% plot_ly(x=~myX, y=~myY) %>% add_text(x=xText, y=yText, text=~year, frame=~year, textfont=list(size=150, color=toRGB(“gray80”))) %>% add_markers(frame=~year, ., layout(., showlegend=FALSE)) %>% animation_slider(hide=TRUE)

Cumulative Animations:

  • Cumulative time series is a process where the time series seem to be growing over time
    • Each frame must contain a path containing all the data up through that year (not just the current year)
  • Can create a cumulative data frame
    • Basically, a frame column is create with the ID being the key year, with a year column retained that shows all the data up through that year
  • The dplyr and purrr packages have a split() function
    • belgium %>% split(.$year) %>% accumulate(~bind_rows(.x, .y)) %>% set_names(1960:2018) %>% bind_rows(.id=“frame”) %>% plot_ly(x=~year, y=~income) %>% add_lines(frame=~frame, showlegend=FALSE)

Example code includes:

# Create an animated bubble chart of house_price against gdp
us_economy %>%
    plot_ly(x = ~gdp, y = ~house_price) %>%
    add_markers(size = ~population, color = ~region, frame = ~year, ids = ~state, 
                marker = list(sizemode = "diameter")
                )


# Animate a bubble chart of house_price against gdp over region
ani <- us_economy %>%
    filter(year==2017) %>%
    plot_ly(x = ~gdp, y = ~house_price) %>%
    add_markers(size = ~population, color = ~region, 
                frame = ~region, ids = ~state, marker = list(sizemode = "diameter")
                )
ani


# Adjust the frame and transition
ani %>% 
    animation_opts(frame = 2000, transition = 300)

# Change the type of transition to "elastic"
ani %>% 
    animation_opts(frame = 2000, transition = 300, easing = "elastic")

# Remove the prefix from the slider and change the font color to "red"
ani %>% 
    animation_opts(frame = 2000, transition = 300, easing = "elastic") %>%
    animation_slider(currentvalue = list(prefix = NULL, font = list(color = "red")))

# Polish the x- and y-axis titles
ani %>% 
    animation_opts(frame = 2000, transition = 300, easing = "elastic") %>% 
    animation_slider(currentvalue = list(prefix = NULL, font = list(color = "red"))) %>%
    layout(xaxis = list(title="Real GDP (millions USD)"), yaxis = list(title="Housing price index"))


# Reduce the bubble size
us_economy %>%
    plot_ly(x = ~gdp, y = ~house_price) %>%
    add_markers(size = ~population, color = ~region, frame = ~year, ids = ~state,
                marker = list(sizemode = "diameter", sizeref=3)
                ) %>%
    layout(xaxis = list(title = "Real GDP (millions USD)", type = "log"), 
           yaxis = list(title = "Housing price index")
           )

# Map state names to the hover info text
us_economy %>%
    plot_ly(x = ~gdp, y = ~house_price, hoverinfo = "text", text = ~state) %>%
    add_markers(size = ~population, color = ~region, frame = ~year, ids = ~state,
                marker = list(sizemode = "diameter", sizeref = 3)
                ) %>%
    layout(xaxis = list(title = "Real GDP (millions USD)", type = "log"),
           yaxis = list(title = "Housing price index")
           )


# Add the year as background text and remove the slider
us_economy %>%
    plot_ly(x = ~gdp, y = ~house_price, hoverinfo = "text", text = ~state) %>%
    add_text(x = 200000, y = 450, text = ~year, frame = ~year,
             textfont = list(color = toRGB("gray80"), size = 150)
             ) %>%
    add_markers(size = ~population, color = ~region, frame = ~year, ids = ~state,
                marker = list(sizemode = "diameter", sizeref = 3)
                ) %>%
    layout(xaxis = list(title = "Real GDP (millions USD)", type = "log"),
           yaxis = list(title = "Housing price index")
           ) %>%
    animation_slider(hide = TRUE)


# extract the 1997 data
us1997 <- us_economy %>%
    filter(year == 1997)

# create an animated scatterplot with baseline from 1997
us_economy %>%
    plot_ly(x = ~gdp, y = ~house_price) %>%
    add_markers(data = us1997, marker = list(color = toRGB("gray60"), opacity = 0.5)) %>%
    add_markers(frame = ~year, ids = ~state, data = us_economy, showlegend = FALSE, alpha = 0.5) %>%
    layout(xaxis = list(type = "log"))


# Find median life HPI for each region in each year
med_hpi <- us_economy %>%
    group_by(region, year) %>%
    summarize(median_hpi=median(house_price))


# Animate the cumulative time series of median HPI over time
med_hpi %>%
    split(.$year) %>%
    accumulate(~bind_rows(.x, .y)) %>%
    set_names(1997:2017) %>%
    bind_rows(.id = "frame") %>%
    plot_ly(x=~year, y=~median_hpi, color=~region) %>%
    add_lines(frame=~frame, showlegend=FALSE)

Chapter 3 - Linking Graphics

Linking Two Charts:

  • Linking charts allows for brushing of one chart to highlight other charts
  • A similar idea applies to linked highlighting for a chart
  • The crosstalk package allows for linking plots using JavaScript (displays in the Rstudio pane)
    • p1 <-
    • p2 <-
    • subplot(p1, p2, titleX=TRUE, titleY=TRUE) %>% hide_legend() # creates separate views
  • Example for creating linked views
    • library(crosstalk)
    • shared_data <- SharedData$new(world2014)
    • p1 <-
    • p2 <-
    • subplot(p1, p2, titleX=TRUE, titleY=TRUE) %>% hide_legend() %>% highlight(on=“plotly_selected”) # creates linked views

Brushing Groups:

  • Can run time series in a similar manner using SharedData()
    • world_indicators %>% SharedDara$new(key=~country) %>% . # this will highlight everything for country all at the same time
  • Can use dplyr commands within the plotly pipeline AFTER the plot_ly() command
    • Cannot directly use the dplyr commands on the shared data object

Selection Strategies:

  • Transient selection (default) is the process where previous selections are forgotten when new selections are made
  • By contrast, persistent selection allows for making multiple selections that are all maintained - can SHIFT-SELECT or can make persistent using code
    • subplot(.) %>% hide_legend() %>% highlight(persistent=TRUE)
    • subplot(.) %>% hide_legend() %>% highlight(persistent=TRUE, dynamic=TRUE) # allows for users to select different colors for different selections
  • Can also use indirect manipulation, such as user-interaction with plotly by way of a dropdown
    • subplot(.) %>% hide_legend() %>% highlight(selectize=TRUE)

Making Shinier Charts:

  • Can create highly interactive charts similar to shiny using plotly
    • shared_data <- SharedData$new(world2014)
    • p1 <- shared_data %>% plot_ly(.) %>% .
    • p2 <- shared_data %>% plot_ly(.) %>% .
    • bscols(p1, p2) # bscols creates two side-by-side html widgets (each has its own control panel)
  • Can modify the bscols() call to add filters
    • bscols(filter_checkbox(id=“four_regions”, label=“region”, sharedData=shared_data, group=~four_regions), p1)
    • bscols(filter_select(id=“four_regions”, label=“region”, sharedData=shared_data, column=~four_regions), p1) # drop-downs
    • bscols(filter_slider(id=“co2”, label=“CO2 Concentrations”, sharedData=shared_data, column=~co2), p1) # sliders
  • Can fix the axes so they do not scale with data selections
    • bscols(filter_slider(id=“co2”, label=“CO2 Concentrations”, sharedData=shared_data, column=~co2), p1 %>% layout(xaxis=list(range(2.5, 5)), yaxis=list(range(-1.4, 1.55)))) # sliders
  • Can put the pieces together in a single call
    • bscols(widths=c(2, 5, 5), list(filter_checkbox(.), filter_slider(.), p1, p2)

Example code includes:

us2017 <- us_economy %>% 
    filter(year == 2017) %>% 
    group_by(state, year, gdp, home_owners, house_price, population, region, division) %>%
    summarize(employment=mean(employment)) %>%
    ungroup()
str(us2017, give.attr=FALSE)


# Create a SharedData object from us2017
shared_us <- crosstalk::SharedData$new(us2017)

# Create a scatterplot of house_price vs. home_owners
p1 <- shared_us %>%
    plot_ly(x = ~home_owners, y = ~house_price) %>%
    add_markers()
 
# Scatterplot of house_price vs. employment rate
p2 <- shared_us %>%
    plot_ly(x = ~employment/population, y = ~house_price) %>%
    add_markers()
  
# Polish the linked scatterplots
linked_plots <- subplot(p1, p2, titleX = TRUE, shareY = TRUE) %>% hide_legend()
linked_plots


# Add a highlight layer
linked_plots %>% 
    highlight()

# Enable linked brushing
linked_plots %>% 
    highlight(on = "plotly_selected")

# Enable hover highlighting
linked_plots %>% 
    highlight(on = "plotly_hover")


# Create a shared data object keyed by individual states
state_data <- us_economy %>%
    crosstalk::SharedData$new(key=~state)

# Using the shared data, plot house price vs. year
state_data %>%
    plot_ly(x=~year, y=~house_price) %>%
    # Group by state
    group_by(state) %>%
    # Add lines
    add_lines()


# Create a shared data object keyed by region
shared_region <- us_economy %>% 
    crosstalk::SharedData$new(key = ~region)

# Create a dotplot of avg house_price by region in 2017
dp_chart <- shared_region %>%
    plot_ly() %>%
    filter(year == 2017) %>%
    group_by(region) %>%
    summarize(avg.hpi = mean(house_price, na.rm = TRUE)) %>%
    add_markers(x = ~avg.hpi, y = ~region)
  
# Code for time series plot
ts_chart <- shared_region %>%
    plot_ly(x = ~year, y = ~house_price) %>%
    group_by(state) %>%
    add_lines()
  
# Link dp_chart and ts_chart
subplot(dp_chart, ts_chart)


# Create a shared data object keyed by division
shared_region <- crosstalk::SharedData$new(us2017, key = ~division)

# Create a bar chart for division
bc <- shared_region %>%
    plot_ly() %>%
    count(division) %>%
    add_bars(x = ~division, y = ~n) %>%
    layout(barmode = "overlay")
  
# Bubble chart
bubble <- shared_region %>%
    plot_ly(x = ~home_owners, y = ~house_price, hoverinfo = "text", text = ~state) %>%
    add_markers(size = ~population, marker = list(sizemode = "diameter"))

# Remove the legend
subplot(bc, bubble) %>% hide_legend()


# Enable persistent hover selection and a color selector 
linked_plots %>%
    highlight(persistent = TRUE, selectize = TRUE, dynamic = TRUE, on="plotly_hover")


# Create a shared data object keyed by state
state_data <- crosstalk::SharedData$new(us2017, key = ~state, group = "Select a state")

# Enable indirect selection by state
state_data %>%
    plot_ly(x = ~home_owners, y = ~house_price, hoverinfo = "text", text = ~state) %>%
    add_markers(size = ~population, marker = list(sizemode = "diameter")) %>%
    highlight(selectize = TRUE)

# Create a shared data object keyed by region
region_data <- crosstalk::SharedData$new(us2017, key = ~region, group = "Select a region")

# Enable indirect selection by region
region_data %>%
    plot_ly(x = ~home_owners, y = ~house_price, hoverinfo = "text", text = ~state) %>%
    add_markers(size = ~population, marker = list(sizemode = "diameter")) %>%
    highlight(selectize = TRUE)


# Create a row of subplots containing p97, p07, and p17 with widths 6, 3, 3
# crosstalk::bscols(widths=c(6, 3, 3) , p97, p07, p17)

# Specify that p07 should span 5 columns
# crosstalk::bscols(widths=c(NA, 5, NA), p97, p07, p17)

# Stack p07 and p17 in the right column
# crosstalk::bscols(p97, list(p07, p17))


# shared data object
shared_us <- crosstalk::SharedData$new(us2017, key = ~region)

# scatterplot of housing price index against homeownership
p17 <- shared_us %>%
    plot_ly(x = ~home_owners, y = ~house_price, color = ~region, height = 400) %>%
    add_markers()
  
# add a column of checkboxes for region to the left of the plot
crosstalk::bscols(widths=c(3, NA),
                  crosstalk::filter_checkbox(id = "region", label = "Region", 
                                             sharedData = shared_us, group = ~region
                                             ), p17
                  )


shared_us <- crosstalk::SharedData$new(us2017)

p17 <- shared_us %>%
    plot_ly(x = ~home_owners, y = ~house_price, color = ~region, height = 400) %>%
    add_markers() %>%
    layout(xaxis = list(title = "Home ownership (%)"), yaxis = list(title = "HPI"))
  
# add a slider filter for each axis below the scatterplot
crosstalk::bscols(list(p17,
                       crosstalk::filter_slider(id = "price", label = "HPI", 
                                                sharedData = shared_us, column = ~house_price
                                                ),
                       crosstalk::filter_slider(id = "owners", label = "Home ownership (%)", 
                                                sharedData = shared_us, column = ~home_owners
                                                )
                       )
                  )

Chapter 4 - Case Study: Space Launches

Introduction to the Data:

  • Government and private company space launches, 5726x11 dataset launches
  • Can explore the “space race” data by government, as well as by private company

Recap: Animation:

  • Can run accumulative animations using accumulate() and bind_rows() as per Chapter 2
  • Need to ensure a common baseline, even if data did not exist for all elements at that time
    • complete_logs <- monthly_logs %>% complete(package, dec_date, fill=list(downloads=0)) # default for fill is NA

Recap: Linked Views and Selector Widgets:

  • The SharedData$new(myData) command allows for sharing data across charts
  • Can use either subplot() or bscols() to create multiple charts
    • Should use highlight() also to enable linked brushing
  • Will use the launch vehicles dataset, focused on a subset of length/diameter, payload capacity to LEO, and total thrust

Wrap Up:

  • Animating plotly charts
    • Fundamentals of plotly, and polishing
    • Animating plotly charts - frame, ids, layer
    • crosstalk::SharedData$new() for linking plotly charts
  • Can extend capabilities, including
    • leaflet, highcharter, trelliscope, rbokeh, etc.
    • Shiny
    • plotly for R

Example code includes:

# table of launches by year
launches_by_year <- launches %>%
    count(launch_year)

# create a line chart of launches over time
launches_by_year %>%
    plot_ly(x=~launch_year, y=~n) %>%
    add_lines() %>%
    layout(xaxis = list(title = "Year"), yaxis = list(title = "Launches"))

# create a filled area chart of launches over time
launches_by_year %>%
    plot_ly(x=~launch_year, y=~n) %>%
    add_lines(fill="tozeroy") %>%
    layout(xaxis = list(title = "Year"), yaxis = list(title = "Launches"))

# create a bar chart of launches over time
launches_by_year %>%
    plot_ly(x=~launch_year, y=~n) %>%
    add_bars() %>%
    layout(xaxis = list(title = "Year"), yaxis = list(title = "Launches"))


# table of launches by year
state_launches <- launches %>% 
    filter(agency_type == "state") %>% 
    count(launch_year, state_code)

# create a ShareData object for plotting
shared_launches <- state_launches %>% 
    crosstalk::SharedData$new(key = ~state_code)

# Create a line chart for launches by state, with highlighting
shared_launches %>%
    plot_ly(x=~launch_year, y=~n, color=~state_code) %>%
    add_lines() %>%
    highlight()


# table of launches by year and agency type
launches_by_year <- launches %>% 
    count(launch_year, agency_type)

# create a ShareData object for plotting
shared_launches <- launches_by_year %>% 
    crosstalk::SharedData$new(key = ~agency_type)

# create a line chart displaying launches by agency type, with highlighting
shared_launches %>%
    plot_ly(x=~launch_year, y=~n, color=~agency_type) %>%
    add_lines() %>%
    highlight()


# Complete the state_launches data set
annual_launches <- launches %>% 
    filter(agency_type == "state") %>%
    count(launch_year, state_code) %>%
    tidyr::complete(state_code, launch_year, fill = list(n = 0))

# Create the cumulative data set
cumulative_launches <- annual_launches %>%
    split(f = .$launch_year) %>%
    accumulate(., ~bind_rows(.x, .y)) %>%
    bind_rows(.id = "frame")

# Create the cumulative animation
cumulative_launches %>%
    plot_ly(x = ~launch_year, y = ~n) %>%
    add_lines(color = ~state_code, frame = ~frame, ids = ~state_code)


# Complete the private_launches data set
annual_launches <- launches %>% 
    filter(agency_type == "private") %>%
    rename(year=launch_year, agency_name=agency) %>%
    count(year, agency_name) %>%
    tidyr::complete(agency_name, year, fill = list(n = 0))

# Create the cumulative data set
cumulative_launches <- annual_launches %>%
    split(f = .$year) %>%
    accumulate(., ~bind_rows(.x, .y)) %>%
    bind_rows(.id = "frame")

# Create the cumulative animation
cumulative_launches %>%
    plot_ly(x = ~year, y = ~n, color = ~agency_name) %>%
    add_lines(frame = ~frame, ids = ~agency_name)


# Create a SharedData object allowing selection by year
# shared_year <- crosstalk::SharedData$new(launches, key = ~launch_year)

# Create a bar chart of launches by year
# bar <- shared_year %>%
#     plot_ly(x = ~launch_year, y = ~n) %>%
#     count(launch_year) %>%
#     add_bars() %>%
#     layout(barmode = "overlay") %>%
#     highlight()

# Create a scatterplot of diameter vs. length
# scatter <- shared_year %>%
#     plot_ly(x = ~length, y = ~diameter) %>%
#     add_markers() %>%
#     highlight()

# Use bscols to link the two charts
# crosstalk::bscols(bar, scatter)


# Create a SharedData object allowing selection of observations
# shared_obs <- crosstalk::SharedData$new(lv2000)

# Create a scatterplot of to_thrust against leo_capacity
# p1 <- shared_obs %>%
#     plot_ly(x = ~leo_capacity, y = ~to_thrust) %>%
#     add_markers() 

# Scatterplot of diameter vs. length
# p2 <- shared_obs %>%
#     plot_ly(x = ~length, y = ~diameter) %>%
#     add_markers() 

# Link p1 and p2
# subplot(p1, p2) %>%
#     highlight(on = "plotly_selected", off = "plotly_deselect") %>%
#     hide_legend()


# SharedData object allowing selection of observations
# shared_obs <- crosstalk::SharedData$new(lv2000)

# Scatterplot of to_thrust against leo_capacity
# scatter <- shared_obs %>%
#     plot_ly(x = ~leo_capacity, y = ~to_thrust) %>%
#     add_markers()

# Create a histogram of to_thrust
# histo <- shared_obs %>%
#     plot_ly(x = ~to_thrust) %>%
    # add_histogram(name = "overall")

# Link the two plots
# subplot(scatter, histo) %>%
#     hide_legend() %>%
#     highlight(on = "plotly_selected")


# Create a SharedData object containing the number of launches by year and state
shared_launches <- launches %>% 
    filter(agency_type == "state") %>%
    count(state_code, launch_year) %>%
    crosstalk::SharedData$new()

# Create a line chart displaying the launches by state
launch_ts <- shared_launches %>%
    plot_ly(x = ~launch_year, y = ~n, color = ~state_code) %>%
    add_lines()  

# Add a slider below the chart to filter the years displayed
crosstalk::bscols(list(launch_ts, 
                       crosstalk::filter_slider(id = "time", label = "Year", 
                                                sharedData = shared_launches, column = ~launch_year
                                                )
                       )
                  )

Defensive R Programming

Chapter 1 - Avoiding Conflict

Defensive R Programming:

  • Course goals are to make robust data procesing pipelines
    • Automate to reduce problems, and learn to diagnose and fix problems that occur

Avoid Reinventing the Wheel:

  • There are many freely available packages for R
  • Many factors in choosing a package
    • Mature?
    • Actively developed?
    • Documentation and vignettes?
    • Common usage?
  • Can look at the “CRAN Task View” page to see a human-curated summary of appropriate packages by application

Packages and Namespaces:

  • The ls() function will reveal the contents of the current environment
    • Can quickly lose track of what is going on when there are many objects
  • Packages use namespaces, or containers for all of the functions in the package
    • library(myPackage) provides access to all the functions in the namespace
    • getNamespaceExports(“dplyr”) # returns all 238 exported functions
    • dplyr::filter # use the filter() function from the dplyr namespace
  • When a function is typed without a namespace, the first namespace containing the function is returned
    • The global environment is always first
    • Then, packages go in load order (most recent packages first)
  • The library(“conflicted”) helps make this easier by identifying and allowing for solutions to conflicts
    • library(“conflicted”)
    • library(“dplyr”)
    • filter # will pull up that there is a conflict
    • conflict_prefer(“filter”, “dplyr”) # sets the default to dplyr::filter

Example code includes:

# Create a data frame of the packages where a newer version is available
old <- old.packages()

# Find how many packages need to be updated
no_of_old_pkgs <- nrow(old)


# Count the number of functions in ggplot2
no_of_functions <- length(getNamespaceExports("ggplot2"))


# Load the dplyr and conflicted packages
library("dplyr")
library("conflicted")

# Prefer the dplyr version of the lag function
conflict_prefer("lag", "dplyr")

# This should return NA, 1, 2, 3
lag(1:4)

Chapter 2 - Early Warning Systems

Early Warning Systems:

  • Early warnings help with both problem avoidance and problem handling
  • Shortcuts such as T / F for TRUE / FALSE can be risky
    • T <- 5 is not protected
    • TRUE <- 5 cannot happen (protected against overwriting)
  • When testing for logical problems, use isTRUE() for protection against numerical values

Message in a Bottle:

  • The message() function provides information to a user
  • Can suppress messages by wrapping calls in suppressMessages(f(.))
    • suppressPackageStartupMessages(library(.))

Warning:

  • The warning() function provides a message while signaling that something may have gone wrong
  • Can suppress using suppressWarnings()
    • It is almost always better to address the underlying issues rather that suppress the warnings

Stop:

  • The stop() function flags an error and stops the code
    • stop(“Insert message here”)
  • Errors cannot be suppressed, since something has gone wrong and must be fixed
  • The try() function is a bit like suppress(), but it will also capture the error before proceeding
    • res <- try(“a” + “b”, silent=TRUE)
    • res # list that shows problems
    • class(res) # will be try-error if something has gone wrong

Example code includes:

# Define F to be interpreted as TRUE
F <- TRUE

# Read in data: don't alter the line below
data_set <- read.csv("iris.csv", header = F)


suppressPackageStartupMessages(library("dplyr"))


# Suppress the standard output of the simulate() function
sim = suppressMessages(simulate(runs = 5))


# Modify the function to make it less noisy
get_distribution <- function(N, verbose = TRUE) {
  results <- numeric(N)
  for(i in 1:N) {
    results[i] <- simulate()
    # Check if verbose is TRUE
    if(isTRUE(verbose)) {
      # Show a progress report
      message("Simulation ", i, " completed")
    }
  }
  return(results)
}


# Create new variable 
x <- c(1, 1, 1)
y <- 1:3

# Suppress the warning
m <- suppressWarnings(cor(x, y))


mean_age = function(ages) {
  if(any(ages < 0)) {
    stop("You have negative ages!")
  }
  # Stop the execution if any of the ages are over 150
  if(any(ages > 150)) {
    stop("You have ages over 150!")
  }
  m = mean(ages)
  return(m)
}

Chapter 3 - Preparing Defenses

Preparing Defenses:

  • DRY is the principle of “Do not Repeat Yourself” - avoid copy/paste, and reuse existing principle
    • The copy and paste rule says that at most one copy/paste should be done
    • Should typically use functions or for loops rather than multiple copy/paste

Comments:

  • Code can be less obvious over time, even to its own author
    • Avoid obvious comments that merely describe a common action
    • Avoid comments that will not be updated - bad comments are worse than no comments
    • Be consistent - have a style of commenting, and stay with it
    • Follow the general rules of grammar, including capitalization
    • Be very careful with jokes or other comments that could cause offsense

Dotty:

  • The full stop in R has a very special meaning - related to S3 Object Oriented Programming
    • summary(m) runs as summary.lm(m) if class(m) is “lm”
    • Generally agreed rule that dots should not be used in variable names

Coding Style:

  • Consistency is key, particularly for the same piece of code and/or collaborators
    • Do not mix and match <- and =
    • Consistent spacing is easy to implement and improves readability
    • Use spaces around assignment operators
    • Add spaces after a comma

Static Code Analysis for R:

  • Static code analysis is the computer checking the code
  • The lintr package checks for syntax and/or semantic issues
    • lintr::lint(“myFileName.R”)

Example code includes:

m <- mean(x)
s <- sd(x)
n <- length(x)
c(m - 1.96 * s/sqrt(n), m + 1.96 * s/sqrt(n))

m <- mean(y)
s <- sd(y)
n <- length(y)
c(m - 1.96 * s/sqrt(n), m + 1.96 * s/sqrt(n))

# Define a function to prevent pasting the code above
ci <- function(values) {
    n <- length(values)
    m <- mean(values) 
    s <- sd(values) 
    c(m - 1.96*s/sqrt(n), m + 1.96*s/sqrt(n))
}


# Define a function to prevent pasting the code above
ci <- function(x, plot_it = FALSE) {
    # Plot the data
    if (isTRUE(plot_it)) hist(x)
    m <- mean(x)
    s <- sd(x)
    n <- length(x)
    c(m - 1.96 * s/sqrt(n), m + 1.96 * s/sqrt(n))
}

# Generate 100 normal random numbers
sample_values <- rnorm(100)
ci(sample_values)


# Fix the code
f <- function(x, y, z) {
    x + y + z
}


# Fix the code
summarise_data <- function(values){
    c(mean(values), median(values))
}

stats <- summarise_data(runif(10))

Chapter 4 - Creating a Battle Plan

Battle Plan:

  • Building work in small, modular chunks allows for better QC and for building on past work
  • Sensible filenames help with organizing and maintaining code
    • Do not use spaces in filenames - “always a bad idea”
    • Underscores have a few small problems - especially that treates _ as a character
    • Generally a best practice to use dashes rather than underscores

Human Readable Filenames:

  • URL slugs are the ending portion of a web address
  • Good filenames should be similar to good slugs - more descriptive is generally better
  • Dates should be unambiguous and nicely sortable
    • ISO8601 is YYYY-MM-DD
  • Numbers in filenames or directories can be handy
    • Using 01 is often better than 1, since sorting will still work for 10-99 files

Organizing Projects:

  • Code often starts with small code, and expands (and becomes more messy) over time
  • Can mitigate issues by having every project in its own directory
    • Can then put the same structure inside every directory
    • Data edited only inside R
    • Can have a directory with the boring name R, and containing the file load.R (will load what is needed from the /input/ directory)
    • Can have standard names like clean.R, analysis.R, etc.

Graphics and Output:

  • Can add directories output and graphics, each generated by the scripts in R

Example code includes:

# The load.R file
library("readr")
library("readxl")

# Print the current working directory
getwd()

# Change to relative paths
battles <- read_csv("input/battles.csv")
foes <- read_xlsx("input/foes.xlsx")


library("ggplot2")
library("readr")

# Show the file/directory structure
list.files(".")

# Change to relative paths and load the data
battles <- read_csv("input/battles.csv")
g <- ggplot(battles) + 
    geom_bar(aes(Location)) # Bar chart

# Change to relative paths and save the data
ggsave(filename = "graphics/locations.pdf", plot = g)


# The load.R file
library("readr")
library("readxl")

# Change to relative paths
battles <- read_csv("input/battles.csv")
foes <- read_xlsx("input/foes.xlsx")

library("ggplot2")
library("readr")
source("R/load.R")

# Create a bar chart
g <- ggplot(battles) + 
    geom_bar(aes(Location))

# Change to relative paths
ggsave(filename = "graphics/locations.pdf", plot = g)

Feature Engineering in R

Chapter 1 - Creating Features from Categorical Data

Introduction to Feature Engineering in R:

  • Feature engineering is the process of modifying and combining input features to best solve the analytical problem
  • Modeling processes tend to be iterative, with a mix of exploring and the refining anf repeating
  • Categorical variables often needed to be encoded as factors (numbers) for machine interpretability
  • Can use the caret package to create dummy variables automatically
    • caret::dummyVars(“~myVar”, data=myData)
  • Can run One Hot Encoding through a series of ifelse statements inside mutate

Binning Encoding: Content Driven:

  • Can use select() and table() to find the categories and counts
  • The case-when approach is valuable when combining multiple categories in to larger groups
    • mutate(new_var = case_when(a == “b” ~ “11”, a == “c” ~ “11”, a == “d” ~ “12”, ., TRUE ~ as.character(a))) # TRUE means everything not in the case_when

Binning Encoding: Data Driven:

  • May need to bin even where there are many potential categories; need more meaningful features than One Hot Encoding
    • prop_results <- prop.table(myTable, 1) # the 1 means proportions of the row
    • prop_results %>% filter(myKeyCol) %>% arrange(n) # sort from low to high
  • Can then use the results to come up with expert cut points for creating a few bins

Example code includes:

id <- c(3410, 9157, 2250, 2353, 4872, 2929, 4077, 1351, 9596, 4157, 3536, 5742, 5183, 2432, 8359, 3055, 6889, 2850, 6975, 7697, 694, 8580, 7161, 3183, 6188, 8429, 7552, 5562, 5455, 5354, 4126, 3246, 2033, 4444, 3875, 2926, 648, 9110, 546, 1871, 4340, 8645, 5766, 189, 6321, 3414, 5686, 2863, 5853, 9226, 655, 2966, 1176, 3817, 1070, 7258, 7871, 7581, 8098, 1123, 7106, 9532, 8566, 4229, 4285, 544, 407, 8262, 6232, 7946, 6995, 8134, 1512, 2302, 2034, 9555, 9467, 7627, 4558, 7661, 7650, 4326, 8771, 8147, 6613, 5080, 9968, 9908, 4585, 3751, 9588, 8491, 84, 6731, 3652, 5449, 4482, 587, 2620, 5655, 9915, 4243, 8141, 8934, 1908, 4480, 2320, 1612, 9315, 8035, 8970, 6039, 1202, 9223, 9879, 2456, 1633, 2798, 4435, 4774, 7706, 6378, 8537, 1235, 1663, 8698, 2397, 951, 8753, 9802, 4504, 426, 1482, 1157, 8866, 9678, 3122, 5903, 1252, 9669, 6233, 5463, 2952, 8259, 1162, 1006, 5625, 8526, 9584, 9177, 5592, 1489, 6035, 223, 62, 751, 393, 2471, 4813, 7462, 1809, 5089, 2290, 2020, 7408, 7338, 1214, 3690, 3715, 250, 2968, 6546, 2300, 8835, 5976, 2783, 3292, 7578, 5519, 3250, 8831, 2278, 9783, 4970, 2446, 5813, 6017, 8048, 4628, 5550, 4995, 636, 6238, 4767, 8013, 4601, 8460, 9929, 9592, 5148, 4215, 8516, 9725, 6753, 3952, 6038, 2682, 5886, 630, 9006, 2710, 812, 1545, 5011, 7587, 7847, 5184, 5889, 7411, 9559, 7008, 5986, 811, 7515, 3084, 929, 6359, 4879, 144, 1712, 6758, 7142, 746, 4947, 3673, 9261, 2745, 9755, 8489, 4073, 8397, 1297, 2407, 4080, 2713, 5275, 8447, 192, 5542, 1927)
grade <- c(5, 12, 3, 3, 6, 4, 5, 2, 12, 6, 5, 7, 7, 3, 11, 4, 9, 4, 9, 10, 1, 11, 9, 4, 8, 11, 10, 7, 7, 7, 6, 4, 3, 6, 5, 4, 1, 11, 1, 3, 6, 11, 8, 1, 8, 5, 7, 4, 8, 12, 1, 4, 2, 5, 2, 9, 10, 10, 10, 2, 9, 12, 11, 6, 6, 1, 1, 10, 8, 10, 9, 10, 2, 3, 3, 12, 12, 10, 6, 10, 10, 6, 11, 10, 8, 7, 12, 12, 6, 5, 12, 11, 1, 9, 5, 7, 6, 1, 4, 7, 12, 6, 10, 11, 3, 6, 3, 2, 12, 10, 11, 8, 2, 12, 12, 3, 2, 4, 6, 6, 10, 8, 11, 2, 2, 11, 3, 2, 11, 12, 6, 1, 2, 2, 11, 12, 4, 8, 2, 12, 8, 7, 4, 10, 2, 2, 7, 11, 12, 12, 7, 2, 8, 1, 1, 1, 1, 3, 6, 9, 3, 7, 3, 3, 9, 9, 2, 5, 5, 1, 4, 8, 3, 11, 8, 4, 4, 10, 7, 4, 11, 3, 12, 7, 3, 8, 8, 10, 6, 7, 7, 1, 8, 6, 10, 6, 11, 12, 12, 7, 6, 11, 12, 9, 5, 8, 4, 8, 1, 11, 4, 1, 2, 7, 10, 10, 7, 8, 9, 12, 9, 8, 1, 10, 4, 2, 8, 6, 1, 3, 9, 9, 1, 7, 5, 12, 4, 12, 11, 5, 11, 2, 3, 5, 4, 7, 11, 1, 7, 3)
gender <- factor(c('Female', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male'))
discipline <- c(0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0)
infraction <- c('academic dishonesty', 'disruptive conduct', 'failure to cooperate', 'failure to cooperate', 'alcohol', 'failure to cooperate', 'fighting', 'disruptive conduct', 'fighting', 'academic dishonesty', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'fighting', 'plagiarism', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'academic dishonesty', 'fighting', 'failure to cooperate', 'alcohol', 'fighting', 'failure to cooperate', 'failure to cooperate', 'alcohol', 'academic dishonesty', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'fighting', 'fighting', 'failure to cooperate', 'failure to cooperate', 'fighting', 'failure to cooperate', 'failure to cooperate', 'fighting', 'disruptive conduct', 'failure to cooperate', 'academic dishonesty', 'academic dishonesty', 'disruptive conduct', 'alcohol', 'failure to cooperate', 'disruptive conduct', 'academic dishonesty', 'failure to cooperate', 'failure to cooperate', 'fighting', 'failure to cooperate', 'failure to cooperate', 'academic dishonesty', 'fighting', 'vandalism', 'failure to cooperate', 'minor incident', 'fighting', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'fighting', 'academic dishonesty', 'disruptive conduct', 'plagiarism', 'academic dishonesty', 'fighting', 'failure to cooperate', 'minor incident', 'academic dishonesty', 'fighting', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'minor incident', 'failure to cooperate', 'failure to cooperate', 'academic dishonesty', 'fighting', 'fighting', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'fighting', 'fighting', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'alcohol', 'fighting', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'fighting', 'academic dishonesty', 'disruptive conduct', 'plagiarism', 'disruptive conduct', 'alcohol', 'failure to cooperate', 'fighting', 'fighting', 'disruptive conduct', 'alcohol', 'academic dishonesty', 'failure to cooperate', 'alcohol', 'fighting', 'failure to cooperate', 'failure to cooperate', 'vandalism', 'failure to cooperate', 'failure to cooperate', 'academic dishonesty', 'failure to cooperate', 'disruptive conduct', 'failure to cooperate', 'failure to cooperate', 'alcohol', 'academic dishonesty')
infraction <- c(infraction, 'fighting', 'failure to cooperate', 'failure to cooperate', 'minor incident', 'failure to cooperate', 'disruptive conduct', 'alcohol', 'fighting', 'academic dishonesty', 'academic dishonesty', 'failure to cooperate', 'failure to cooperate', 'fighting', 'fighting', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'alcohol', 'academic dishonesty', 'vandalism', 'failure to cooperate', 'fighting', 'failure to cooperate', 'disruptive conduct', 'failure to cooperate', 'alcohol', 'fighting', 'disruptive conduct', 'fighting', 'disruptive conduct', 'academic dishonesty', 'disruptive conduct', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'disruptive conduct', 'fighting', 'fighting', 'failure to cooperate', 'failure to cooperate', 'academic dishonesty', 'fighting', 'fighting', 'disruptive conduct', 'academic dishonesty', 'alcohol', 'disruptive conduct', 'failure to cooperate', 'failure to cooperate', 'academic dishonesty', 'minor incident', 'failure to cooperate', 'academic dishonesty', 'failure to cooperate', 'fighting', 'fighting', 'vandalism', 'fighting', 'failure to cooperate', 'failure to cooperate', 'academic dishonesty', 'failure to cooperate', 'vandalism', 'fighting', 'minor incident', 'plagiarism', 'minor incident', 'academic dishonesty', 'failure to cooperate', 'failure to cooperate', 'alcohol', 'alcohol', 'academic dishonesty', 'alcohol', 'fighting', 'fighting', 'alcohol', 'failure to cooperate', 'minor incident', 'alcohol', 'fighting', 'failure to cooperate', 'academic dishonesty', 'fighting', 'failure to cooperate', 'disruptive conduct', 'failure to cooperate', 'disruptive conduct', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'academic dishonesty', 'failure to cooperate', 'failure to cooperate', 'academic dishonesty', 'failure to cooperate', 'fighting', 'academic dishonesty', 'failure to cooperate', 'failure to cooperate', 'plagiarism', 'fighting', 'academic dishonesty', 'disruptive conduct', 'academic dishonesty', 'academic dishonesty', 'vandalism', 'failure to cooperate', 'fighting', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'fighting', 'failure to cooperate', 'failure to cooperate', 'alcohol', 'disruptive conduct', 'failure to cooperate', 'academic dishonesty', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'failure to cooperate', 'fighting', 'alcohol', 'academic dishonesty', 'failure to cooperate')
times <- c(1473938153, 1483110896, 1489060088, 1493899994, 1493909505, 1483614394, 1492518486, 1494502453, 1495209565, 1494590293, 1483704752, 1476705077, 1481545093, 1489416862, 1493390730, 1478088653, 1476705910, 1485259499, 1480431250, 1492614726, 1486539571, 1478700385, 1484581056, 1482755357, 1483446278, 1482247343, 1474383858, 1473682392, 1481283681, 1485173234, 1489503848, 1477492361, 1485951473, 1491562724, 1477914755, 1491835059, 1484295102, 1478011951, 1478592346, 1482493816, 1486468634, 1475075730, 1481633206, 1478765263, 1485173006, 1487245747, 1477395294, 1490617068, 1491828436, 1484238621, 1476085966, 1492778297, 1491910329, 1479739916, 1484653465, 1485788584, 1495206822, 1482762803, 1480604700, 1490701662, 1475679546, 1491923600, 1482503451, 1474974934, 1494851233, 1487058134, 1489389522, 1490191070, 1484743053, 1491922384, 1477492959, 1487602919, 1494850933, 1495717272, 1487330996, 1495726535, 1478531164, 1488290353, 1472818971, 1490193538, 1496153591, 1489582550, 1494339621, 1478530685, 1485432998, 1490715336, 1491234140, 1481122391, 1479987613, 1489577270, 1492183462, 1479223633, 1480577585, 1493810573, 1474978111, 1476282532, 1473941219, 1474360465, 1492516515, 1477999051, 1474470475, 1478780271, 1487256863, 1487079493, 1495538105, 1496156561, 1486468944, 1494246393, 1475072152, 1473090763, 1488381954, 1488540975, 1492688980, 1478012745, 1481643977, 1478085901, 1475582009, 1489578677, 1490357943, 1489579269, 1478186519, 1479470482, 1485959227, 1484049999, 1494329765, 1481296561, 1493900316, 1475654478, 1481207010, 1480430874, 1476445526, 1483949281, 1479731318, 1480592314, 1476369532, 1490193865, 1487334057, 1473941362, 1478090382, 1477666074, 1479471905, 1473421548, 1486555602, 1493304379, 1485346538, 1476952402, 1485428735, 1486566600, 1487253880, 1474554634, 1476707895, 1493987697, 1484567632, 1482305579, 1479974396, 1479369745, 1485933183, 1491392716, 1475582850, 1479393434, 1484061732, 1488541793, 1474460983, 1488540387, 1493908692, 1475074857, 1486641048, 1494417673, 1491564900, 1481096843, 1487595204, 1493644827, 1476879918, 1480086060, 1493900260, 1490095752, 1496317517, 1478791768, 1493304308, 1485528923, 1483970062, 1473768259, 1490024346, 1489145829, 1494415660, 1483360369, 1481899428, 1490195704, 1474976116, 1478012482, 1487772608, 1495097185, 1493206536, 1484575342, 1478700512, 1479987287, 1492437830, 1483458545, 1480346620, 1493639689, 1485432736, 1477924233, 1491579799, 1479395782, 1476793778, 1493120617, 1476978304, 1496143032, 1482913710, 1489592461, 1479121859, 1478678802, 1486639883, 1485172303, 1486045483, 1485961556, 1473939375, 1493985609, 1475509067, 1485441623, 1491835910, 1491393930, 1480663866, 1479134060, 1492600360, 1485329189, 1494591915, 1476185444, 1493797262, 1489061537, 1482320494, 1484925313, 1476947699, 1480595285, 1485171879, 1479998225, 1485518119, 1481210950, 1496158904, 1478520382, 1487259327, 1476445306, 1486382276, 1485776699, 1475235680, 1493641377, 1477495455, 1487775512, 1492603708, 1479815502)
samps <- c(3230, 6560, 2134, 2228, 7145, 2773, 3865, 1280, 9692, 3935, 3348, 5202, 4754, 6900, 8771, 2893, 5992, 2698, 7739, 8273, 657, 8947, 7877, 3015, 5569, 8823, 6160, 5062, 4968, 4892, 6987, 6942, 1925, 4167, 3674, 6927, 615, 9339, 516, 1772, 4083, 6423, 5221, 180, 5675, 3234, 5159, 2711, 5296, 9419, 622, 2808, 1112, 6970, 1015, 6076, 8401, 8187, 8570, 1064, 7833, 6658, 8935, 3992, 4041, 514, 386, 8695, 5602, 8466, 7755, 8595, 1431, 2181, 1926, 9661, 6643, 6177, 4253, 8242, 8234, 4072, 6459, 6316, 5911, 7177, 9974, 6753, 4272, 3553, 9685, 8869, 79, 5951, 3457, 7257, 4197, 555, 2483, 5133, 9931, 4004, 8598, 9200, 1805, 7056, 2196, 1528, 9486, 8530, 6515, 5448, 1137, 6578, 6748, 2325, 1548, 2651, 4160, 4424, 8281, 5724, 8909, 1168, 1577, 9031, 2269, 901, 9070, 9844, 4209, 404, 1403, 1094, 9147, 9757, 2956, 5339, 1185, 9750, 5603, 4975, 2795, 8692, 1099, 953, 5106, 8901, 9682, 6566, 5083, 1410, 5445, 211, 59, 713, 372, 2340, 4455, 8101, 6866, 4681, 2171, 1912, 8065, 8017, 1148, 3494, 3519, 236, 2810, 5857, 2179, 9128, 5395, 2636, 3118, 8184, 7267, 6943, 9126, 2160, 6725, 4584, 2316, 5263, 7357, 8540, 4307, 7271, 7162, 604, 5606, 7120, 8517, 4287, 8846, 9942, 9689, 4727, 3980, 8891, 9787, 7568, 3747, 5447, 6912, 5325, 598, 9252, 2567, 771, 1464, 4619, 8193, 6237, 4755, 5327, 6115, 9663, 6015, 5404, 770, 8135, 2919, 881, 5706, 4504, 137, 1622, 5958, 7861, 708, 4564, 3478, 9445, 2601, 6719, 8867, 3861, 6371, 1227, 2279, 3868, 2570, 4833, 8836, 6781, 5043, 1824)

discipline_logs <- tibble::tibble(id, grade, discipline, gender, 
                                  infraction=factor(infraction), 
                                  timestamp=as.POSIXct(times, origin="1970-01-01", tz="UTC")
                                  )
glimpse(discipline_logs)
## Observations: 250
## Variables: 6
## $ id         <dbl> 3410, 9157, 2250, 2353, 4872, 2929, 4077, 1351, 959...
## $ grade      <dbl> 5, 12, 3, 3, 6, 4, 5, 2, 12, 6, 5, 7, 7, 3, 11, 4, ...
## $ discipline <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, ...
## $ gender     <fct> Female, Female, Female, Female, Male, Male, Male, F...
## $ infraction <fct> academic dishonesty, disruptive conduct, failure to...
## $ timestamp  <dttm> 2016-09-15 11:15:53, 2016-12-30 15:14:56, 2017-03-...
discipline_logs <- discipline_logs %>%
    mutate(male = ifelse(gender == "Male", 1, 0), female = ifelse(gender == "Female", 1, 0))


# Create a new column with the proper string encodings
discipline_logs_new <-  discipline_logs %>%
    mutate(school_type = case_when(grade >= 1 & grade <= 5 ~ "elementary_school",
                                   grade >= 6 & grade <= 8 ~ "middle_school",
                                   grade <= 12 & grade >= 9 ~ "high_school"
                                   )
           )

# Look at a table of the new column 
discipline_logs_new %>% 
    select(school_type) %>% 
    table()
## .
## elementary_school       high_school     middle_school 
##                99                85                66
discipline_logs_new <- discipline_logs_new %>%
    mutate(elem_sch = ifelse(school_type == "elementary_school", 1, 0),
           mid_sch = ifelse(school_type == "middle_school", 1, 0),
           high_sch = ifelse(school_type == "high_school", 1, 0)
           )


# Create a table of the frequencies
discipline_table <- discipline_logs %>% 
    select(grade, discipline) %>% 
    table()

# Create a table of the proportions
prop_table <- prop.table(discipline_table, 1)


dgr_prop <- discipline_logs %>%
    group_by(grade) %>%
    summarize(proportion=mean(discipline))
dgr_prop
## # A tibble: 12 x 2
##    grade proportion
##    <dbl>      <dbl>
##  1     1     0.0455
##  2     2     0     
##  3     3     0.0952
##  4     4     0.190 
##  5     5     0.0714
##  6     6     0.182 
##  7     7     0.217 
##  8     8     0.0476
##  9     9     0.6   
## 10    10     0.818 
## 11    11     0.826 
## 12    12     0.64
# Combine the proportions and discipline logs data
discipline <- inner_join(discipline_logs, dgr_prop, by = "grade")

# Display a glimpse of the new data frame
glimpse(discipline)
## Observations: 250
## Variables: 9
## $ id         <dbl> 3410, 9157, 2250, 2353, 4872, 2929, 4077, 1351, 959...
## $ grade      <dbl> 5, 12, 3, 3, 6, 4, 5, 2, 12, 6, 5, 7, 7, 3, 11, 4, ...
## $ discipline <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, ...
## $ gender     <fct> Female, Female, Female, Female, Male, Male, Male, F...
## $ infraction <fct> academic dishonesty, disruptive conduct, failure to...
## $ timestamp  <dttm> 2016-09-15 11:15:53, 2016-12-30 15:14:56, 2017-03-...
## $ male       <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, ...
## $ female     <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, ...
## $ proportion <dbl> 0.0714, 0.6400, 0.0952, 0.0952, 0.1818, 0.1905, 0.0...
# Create a new column with three levels using the proportions as ranges
discipline_ed <- discipline %>%
   mutate(education_levels = case_when(proportion >= 0 & proportion <= .20 ~ "low_grade",
                                       proportion > .20 & proportion <= .25 ~ "middle_grade", 
                                       proportion > .25 & proportion <= 1 ~ "high_grade"
                                       )
          )

glimpse(discipline_ed)
## Observations: 250
## Variables: 10
## $ id               <dbl> 3410, 9157, 2250, 2353, 4872, 2929, 4077, 135...
## $ grade            <dbl> 5, 12, 3, 3, 6, 4, 5, 2, 12, 6, 5, 7, 7, 3, 1...
## $ discipline       <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, ...
## $ gender           <fct> Female, Female, Female, Female, Male, Male, M...
## $ infraction       <fct> academic dishonesty, disruptive conduct, fail...
## $ timestamp        <dttm> 2016-09-15 11:15:53, 2016-12-30 15:14:56, 20...
## $ male             <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, ...
## $ female           <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, ...
## $ proportion       <dbl> 0.0714, 0.6400, 0.0952, 0.0952, 0.1818, 0.190...
## $ education_levels <chr> "low_grade", "high_grade", "low_grade", "low_...

Chapter 2 - Creating Features from Numeric Data

Numerical Bucketing or Binning:

  • Bins can work better than numbers when there are non-linear relationships among the variables
  • Can bucket in to equal buckets either equal by quantile or equal by variable
    • myDF %>% mutate(myVar_cat = cut(myVar, breaks=seq(myLow, myHigh, by=myBy)))
    • model.matrix(myResponse ~ myVar_cat - 1, data=myDF) # the -1 means ‘less than full rank’ or ‘include all the categories’

Binning Numerical Data Using Quantiles:

  • Can run quantile bucketing to bin in to equal volumes
    • myDF %>% mutate(myQ = ntile(myVar, 5)) % will create 5 equally sized buckets numbered 1:5

Date and Time Feature Extraction

  • The lubridate package from the tidyverse makes working with dates and times much easier
    • mdy_hm()
    • ymd_hm()
    • ymd_hms()
  • Can extract information from dates using lubridate
    • wday() # 1 is Sunday, 7 is Saturday
    • wday(, label=TRUE) # label with the 3-character abbreviation
    • lubridate::hour()

Example code includes:

samps <- c(1115, 3501, 3729, 3314, 1269, 3992, 443, 643, 3509, 3795, 3020, 1011, 8, 696, 419, 2159, 228, 4319, 1154, 4927, 1477, 4127, 4413, 2767, 2156, 2242, 3810, 1430, 2695, 4422, 2168, 1348, 1642, 2315, 4576, 400, 4062, 4273, 848, 3838, 1162, 2721, 3860, 3873, 922, 2844, 1182, 2907, 1759, 184, 227, 3407, 3571, 4852, 3487, 4254, 2455, 430, 15, 3942, 4267, 2497, 45, 2213, 4114, 1807, 3107, 1095, 3391, 1889, 2642, 2246, 101, 4134, 3057, 3603, 2847, 3980, 3784, 195, 645, 4762, 1321, 2524, 2496, 1388, 12, 292, 2881, 3214, 216, 3982, 67, 2348, 1765, 1784, 1499, 2925, 4984, 648, 3773, 2711, 3945, 3780, 3995, 83, 801, 2684, 3604, 3271, 1101, 2264, 1487, 4541, 936, 4365, 3013, 2442, 3334, 2311, 4722, 2895, 4775, 3100, 3599, 1986, 2061, 3415, 4111, 2358, 3825, 535, 2539, 3738, 2736, 2478, 4646, 4452, 2601, 1790, 2429, 1246, 3849, 4072, 3480, 512, 2921, 4881, 3752, 2281, 1970, 382, 4368, 4855, 3001, 1010, 4492, 1433, 3806, 3238, 81, 1906, 2127, 389, 3581, 3940, 2620, 4565, 117, 4057, 2025, 4814, 364, 3794, 4545, 4399, 589, 2550, 2094, 802, 3665, 4540, 2409, 3774, 3679, 2073, 2212, 222, 970, 432, 3808, 4518, 4544, 3311, 3094, 2514, 2890, 3675, 3551, 1572, 4614, 4128, 2752, 1924, 3689, 4367, 77, 1345, 885, 974, 1241, 4272, 1068, 1096, 4233, 4819, 554, 4642, 780, 3863, 437, 1580, 935, 2649, 2481, 1869, 2380, 1169, 3809, 4900, 3931, 4437, 4213, 2657, 2146, 1191, 1896, 3984, 1252, 4094, 4670, 3036, 4948, 1220, 4137, 4721, 3939, 3983, 3432, 224)
Quantity <- c(2, 2, 3, 3, 20, 10, 3, 1, 6, 48, 1, 1, 1, 2, 6, 1, 3, 5, 1, 24, 1, 1, 4, 6, 12, 3, 1, 3, 6, 8, 2, 3, 3, 1, 1, 2, 2, 12, 4, 12, 16, 1, 1, 2, 2, 1, 12, 2, 6, 1, 31, 1, 1, 20, 3, 12, 3, 12, 1, 1, 6, 2, 12, 6, 4, 1, 2, 4, 12, 8, 6, 3, 1, 6, 6, 16, 1, 6, 1, 3, 24, 1, 3, 2, 1, 1, 3, 6, 1, 5, 48, 20, 36, 2, 9, 1, 12, 8, 5, 1, 1, 1, 12, 1, 12, 3, 2, 2, 6, 1, 1, 6, 24, 6, 1, 6, 3, 4, 10, 1, 3, 1, 1, 10, 2, 5, 6, 6, 1, 12, 12, 24, 3, 1, 2, 1, 10, 2, 3, 48, 6, 24, 12, 6, 1, 2, 1, 3, 1, 4, 2, 8, 6, 12, 2, 1, 1, 2, 24, 1, 6, 4, 1, 7, 8, 1, 3, 12, 2, 5, 36, 4, 1, 24, 4, 20, 2, 12, 3, 4, 1, 6, 2, 1, 1, 1, 24, 24, 2, 6, 3, 4, 12, 2, 12, 1, 12, 1, 2, 10, 2, 48, 2, 6, 6, 1, 48, 6, 1, 4, 48, 6, 24, 3, 1, 2, 1, 1, 8, 12, 16, 12, 3, 1, 12, 12, 6, 8, 24, 2, 2, 4, 2, 1, 2, 3, 1, 1, 2, 1, 24, 1, 2, 6, 4, 1, 3, 6, 12, 3)

online_retail <- tibble::tibble(Quantity)
glimpse(online_retail)
## Observations: 250
## Variables: 1
## $ Quantity <dbl> 2, 2, 3, 3, 20, 10, 3, 1, 6, 48, 1, 1, 1, 2, 6, 1, 3,...
# Summarize the Quantity variable
online_retail %>% 
    select(Quantity) %>%
    summary()
##     Quantity   
##  Min.   : 1.0  
##  1st Qu.: 1.0  
##  Median : 3.0  
##  Mean   : 6.9  
##  3rd Qu.: 8.0  
##  Max.   :48.0
# Create a histogram of the possible variable values
ggplot(online_retail, aes(x = Quantity)) + 
    geom_histogram(stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

# Create a sequence of numbers to capture the Quantity range
seq(1, 46, by=5)
##  [1]  1  6 11 16 21 26 31 36 41 46
# Use the cut function to create a variable quant_cat
online_retail <- online_retail %>%
    mutate(quant_cat = cut(Quantity, breaks = seq(1, 50, by = 5)))

# Create a table of the new column quant_cat
online_retail %>%
    select(quant_cat) %>%
    table()
## .
##   (1,6]  (6,11] (11,16] (16,21] (21,26] (26,31] (31,36] (36,41] (41,46] 
##     115      14      28       4      12       1       2       0       0
# Create new columns from the quant_cat feature
head(model.matrix(~ quant_cat - 1, data = online_retail))
##   quant_cat(1,6] quant_cat(6,11] quant_cat(11,16] quant_cat(16,21]
## 1              1               0                0                0
## 2              1               0                0                0
## 3              1               0                0                0
## 4              1               0                0                0
## 5              0               0                0                1
## 6              0               1                0                0
##   quant_cat(21,26] quant_cat(26,31] quant_cat(31,36] quant_cat(36,41]
## 1                0                0                0                0
## 2                0                0                0                0
## 3                0                0                0                0
## 4                0                0                0                0
## 5                0                0                0                0
## 6                0                0                0                0
##   quant_cat(41,46]
## 1                0
## 2                0
## 3                0
## 4                0
## 5                0
## 6                0
# Break the Quantity variable into 3 buckets
online_retail <- online_retail %>% 
    mutate(quant_q = ntile(Quantity, 3))

# Use table to look at the new variable
online_retail %>% 
    select(quant_q) %>% 
    table()
## .
##  1  2  3 
## 84 83 83
# Use table to look at the new variable
online_retail %>% 
    select(quant_q) %>% 
    table()
## .
##  1  2  3 
## 84 83 83
# Specify a full rank representation of the new column
head(model.matrix(~ quant_q, data = online_retail))
##   (Intercept) quant_q
## 1           1       1
## 2           1       1
## 3           1       2
## 4           1       2
## 5           1       3
## 6           1       3
# Look at the column timestamp
discipline_logs %>% 
    select(timestamp) %>% 
    glimpse()
## Observations: 250
## Variables: 1
## $ timestamp <dttm> 2016-09-15 11:15:53, 2016-12-30 15:14:56, 2017-03-0...
# Assign date format to the timestamp_date column
discipline_logs <- discipline_logs %>% 
    mutate(timestamp_date=lubridate::ymd_hms(timestamp))


# Create new column dow (day of the week) 
discipline_logs <- discipline_logs %>% 
    mutate(dow = lubridate::wday(timestamp_date, label = TRUE))

head(discipline_logs)
## # A tibble: 6 x 10
##      id grade discipline gender infraction timestamp            male female
##   <dbl> <dbl>      <dbl> <fct>  <fct>      <dttm>              <dbl>  <dbl>
## 1  3410     5          0 Female academic ~ 2016-09-15 11:15:53     0      1
## 2  9157    12          0 Female disruptiv~ 2016-12-30 15:14:56     0      1
## 3  2250     3          0 Female failure t~ 2017-03-09 11:48:08     0      1
## 4  2353     3          0 Female failure t~ 2017-05-04 12:13:14     0      1
## 5  4872     6          1 Male   alcohol    2017-05-04 14:51:45     1      0
## 6  2929     4          0 Male   failure t~ 2017-01-05 11:06:34     1      0
## # ... with 2 more variables: timestamp_date <dttm>, dow <ord>
# Create new column hod (hour of day) 
discipline_logs <- discipline_logs %>% 
    mutate(hod = lubridate::hour(timestamp_date))

head(discipline_logs)
## # A tibble: 6 x 11
##      id grade discipline gender infraction timestamp            male female
##   <dbl> <dbl>      <dbl> <fct>  <fct>      <dttm>              <dbl>  <dbl>
## 1  3410     5          0 Female academic ~ 2016-09-15 11:15:53     0      1
## 2  9157    12          0 Female disruptiv~ 2016-12-30 15:14:56     0      1
## 3  2250     3          0 Female failure t~ 2017-03-09 11:48:08     0      1
## 4  2353     3          0 Female failure t~ 2017-05-04 12:13:14     0      1
## 5  4872     6          1 Male   alcohol    2017-05-04 14:51:45     1      0
## 6  2929     4          0 Male   failure t~ 2017-01-05 11:06:34     1      0
## # ... with 3 more variables: timestamp_date <dttm>, dow <ord>, hod <int>
# Create histogram of hod 
discipline_logs %>% 
    ggplot(aes(x=hod)) + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Chapter 3 - Transforming Numerical Features

Box and Yeo Transformations:

  • Transformations can be helpful to address variables on different scales, and skews
  • Power transformations can be used to convert to something resembling normality
    • The Box-Cox transformation estimates the best power for the transform to be closer to normality (variable must always be greater than zero)
    • The Yeo-Johnson transformation can be used for variables that are 0 or negative
    • processed_vars <- caret::preProcess(myVar, method=c(“YeoJohnson”))
    • transformed <- predict(processed_vars, myDF)

Normalization Techniques:

  • Rescaling to a range can be an effective technique
  • One common method of rescaling is to create (x - xmin) / (xmax - xmin)
    • processed_vars <- caret::preProcess(myVars, method=c(“range”))
    • transformed <- predict(processed_vars, myDF)
  • Can center variables using method=c(“center”) in caret::preProcess()

Z-score Standardization:

  • Z-score standardization normalizes all variables to N(0, 1)
    • (x - mean(x)) / sd(x)
    • processed_vars <- caret::preProcess(myVars, method=c(“center”, “scale”))
    • transformed <- predict(processed_vars, myDF)

Example code includes:

defense <- c(49, 63, 83, 43, 58, 78, 65, 80, 100, 35, 55, 50, 30, 50, 40, 40, 55, 75, 35, 60, 30, 65, 44, 69, 40, 55, 85, 110, 52, 67, 87, 40, 57, 77, 48, 73, 40, 75, 20, 45, 35, 70, 55, 70, 85, 55, 80, 50, 60, 25, 50, 35, 60, 48, 78, 35, 60, 45, 80, 40, 65, 95, 15, 30, 45, 50, 70, 80, 35, 50, 65, 35, 65, 100, 115, 130, 55, 70, 65, 110, 70, 95, 55, 45, 70, 55, 80, 50, 75, 100, 180, 30, 45, 60, 160, 45, 70, 90, 115, 50, 70, 80, 85, 95, 110, 53, 79, 75, 95, 120, 95, 120, 5, 115, 80, 70, 95, 60, 65, 55, 85, 65, 80, 35, 57, 57, 100, 95, 55, 79, 80, 48, 50, 60, 60, 60, 70, 100, 125, 90, 105, 65, 65, 100, 85, 90, 45, 65, 95, 90, 100, 65, 80, 100, 43, 58, 78, 64, 80, 100, 34, 64, 30, 50, 30, 50, 40, 70, 80, 38, 58, 15, 28, 15, 65, 85, 45, 70, 40, 55, 85, 95, 50, 80, 115, 75, 40, 50, 70, 55, 30, 55, 45, 45, 85, 60, 110, 42, 80, 60, 48, 58, 65, 90, 140, 70, 105, 200, 50, 75, 85, 100, 230, 75, 55, 50, 75, 40, 120, 40, 80, 95, 35, 75, 45, 70, 140, 30, 50, 95, 60, 120, 90, 62, 35, 35, 95, 15, 37, 37, 105, 10, 75, 85, 115, 50, 70, 110, 130, 90, 100, 35, 45, 65, 40, 60, 70, 50, 70, 90, 35, 70, 41, 61, 35, 55, 50, 55, 70, 30, 50, 70, 50, 40, 60, 30, 60, 30, 100, 25, 35, 65, 32, 62, 60, 80, 60, 80, 100, 90, 45, 45, 23, 43, 63, 30, 60, 40, 135, 45, 65, 75, 85, 100, 140, 180, 55, 75, 40, 60, 40, 50, 75, 75, 45, 53, 83, 20, 40, 35, 45, 40, 70, 140, 35, 65, 60, 45, 50, 80, 40, 60, 60, 90, 60, 60, 65, 85, 43, 73, 65, 85, 55, 105, 77, 97, 50, 100, 20, 79, 70, 70, 35, 65, 90, 130, 83, 80, 60, 48, 50, 80, 50, 70, 90, 85, 105, 105, 130, 55, 60, 100, 80, 80, 100, 130, 200, 100, 150, 90, 80, 90, 140, 90, 100, 50, 64, 85, 105, 44, 52, 71, 53, 68, 88, 30, 50, 70, 40, 60, 41, 51, 34, 49, 79, 35, 65, 40, 60, 118, 168, 45, 85, 50, 42, 102, 70, 35, 55, 45, 70, 48, 68, 66, 34, 44, 44, 84, 60, 52, 42, 64, 50, 47, 67, 86, 116, 95, 45, 5, 45, 108, 45, 65, 95, 40, 40, 70, 78, 118, 90, 110, 40, 65, 72, 56, 76, 50, 50, 75, 65, 115, 95, 130, 125, 67, 67, 95, 86, 130, 110, 125, 80, 70, 65, 145, 135, 70, 77, 130, 105, 70, 120, 100, 106, 110, 120, 120, 80, 100, 90, 100, 120, 100, 55, 75, 95, 45, 55, 65, 45, 60, 85, 39, 69, 45, 65, 90, 37, 50, 48, 63, 48, 63, 48, 63, 45, 85, 50, 62, 80, 32, 63, 85, 105, 130, 43, 55, 40, 60, 86, 55, 85, 95, 40, 55, 75, 85, 75, 70, 90, 80, 59, 99, 89, 60, 85, 50, 75, 65, 35, 45, 80, 45, 55, 67, 85, 125, 70, 115, 80, 85, 145, 103, 133, 45, 65, 62, 82, 40, 60, 40, 60, 50, 70, 95, 40, 50, 75, 50, 63, 50, 65, 85, 50, 70, 60, 45, 105)
defense <- c(defense, 45, 70, 50, 70, 80, 50, 60, 91, 131, 70, 95, 115, 40, 70, 80, 55, 75, 55, 60, 90, 60, 70, 90, 40, 80, 50, 85, 40, 84, 50, 60, 90, 50, 80, 70, 100, 95, 50, 75, 75, 105, 66, 112, 50, 70, 90, 55, 65, 129, 90, 72, 70, 70, 100, 120, 90, 90, 90, 77, 95, 65, 95, 122, 40, 58, 72, 40, 52, 67, 38, 77, 43, 55, 71, 40, 60, 50, 58, 72, 39, 47, 68, 48, 62, 62, 78, 60, 54, 76, 100, 150, 150, 60, 72, 66, 86, 53, 88, 67, 115, 60, 90, 62, 88, 33, 52, 77, 119, 50, 72, 65, 75, 57, 150, 35, 53, 70, 91, 48, 76, 70, 122, 85, 184, 35, 80, 95, 95, 121, 150, 60, 120, 55, 75, 75, 40, 50, 90, 54, 69, 74, 30, 50, 75, 30, 60, 45, 95, 90, 57, 77, 70, 40, 60, 40, 65, 20, 62, 152, 70, 100, 52, 92, 35, 90, 55, 80, 40, 60, 50, 80, 38, 48, 98, 90, 80, 90, 40, 140, 80, 110, 130, 95, 95, 100, 65, 135, 63, 80, 70, 85, 100, 65, 90, 125, 85, 75, 115, 115, 31, 131, 107, 89, 47, 139, 37, 71, 103, 131, 53, 101, 115, 80, 123, 111, 78, 120, 40, 80, 35, 70, 70, 40, 40, 40, 40, 40, 40, 50, 90, 120, 40, 75, 30, 60, 35, 60, 65, 100, 115, 130, 180, 50, 75, 80, 85, 110, 100, 120, 109, 85, 100, 70, 105, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 230, 140, 115, 90, 150, 75, 80, 110, 65, 125, 125, 230, 85, 80, 70, 100, 110, 70, 70, 70, 75, 60, 80, 130, 150, 120, 100, 90, 160, 100, 20, 160, 90, 45, 45, 105, 95, 70, 48, 68, 94, 115, 88, 105, 95, 107, 107, 107, 107, 107, 100, 75, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 126, 65, 105, 50, 50, 50, 70, 70, 70, 80, 70, 90, 90, 100, 90, 90, 95, 95, 95, 95, 67, 67, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 39, 39, 39, 39, 47, 47, 47, 47, 67, 68, 68, 68, 68, 60, 60, 60, 60, 60, 60, 60, 60, 60, 76, 50, 70, 70, 70, 122, 122, 122, 95, 71, 71, 121, 121, 110, 60, 60, 90, 70, 70, 70, 75, 130, 90, 60, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 100, 100, 100, 100, 100, 100, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 125, 115)
speed <- c(45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30, 70, 50, 35, 75, 56, 71, 101, 72, 97, 70, 100, 55, 80, 90, 110, 40, 65, 41, 56, 76, 50, 65, 85, 35, 60, 65, 100, 20, 45, 55, 90, 30, 40, 50, 25, 30, 45, 90, 95, 120, 90, 115, 55, 85, 70, 95, 60, 95, 90, 90, 70, 90, 105, 120, 35, 45, 55, 40, 55, 70, 70, 100, 20, 35, 45, 90, 105, 15, 30, 45, 70, 60, 75, 100, 45, 70, 25, 50, 40, 70, 80, 95, 110, 70, 42, 67, 50, 75, 100, 150, 40, 55, 35, 45, 87, 76, 30, 35, 60, 25, 40, 50, 60, 90, 60, 85, 63, 68, 85, 115, 90, 105, 95, 105, 93, 85, 110, 80, 81, 60, 48, 55, 65, 130, 65, 40, 35, 55, 55, 80, 130, 30, 85, 100, 90, 50, 70, 80, 130, 100, 45, 60, 80, 65, 80, 100, 43, 58, 78, 20, 90, 50, 70, 55, 85, 30, 40, 130, 67, 67, 60, 15, 15, 20, 40, 70, 95, 35, 45, 55, 50, 40, 50, 30, 70, 50, 80, 110, 85, 30, 30, 95, 15, 35, 110, 65, 91, 30, 85, 48, 33, 85, 15, 40, 45, 85, 30, 30, 45, 85, 65, 5, 85, 115, 40, 55, 20, 30, 50, 50, 35, 65, 45, 75, 70, 70, 65, 95, 85, 40, 50, 60, 85, 75, 35, 70, 65, 95, 83, 100, 55, 115, 100, 85, 41, 51, 61, 110, 90, 100, 70, 95, 120, 45, 55, 80, 40, 50, 60, 35, 70, 60, 100, 20, 15, 65, 15, 65, 30, 50, 70, 30, 60, 80, 85, 125, 85, 65, 40, 50, 80, 65, 80, 35, 70, 30, 90, 100, 40, 160, 40, 28, 48, 68, 25, 50, 20, 30, 50, 90, 50, 50, 30, 40, 50, 60, 80, 65, 105, 95, 95, 85, 85, 65, 40, 55, 65, 95, 60, 60, 35, 40, 20, 60, 80, 60, 10, 70, 100, 35, 55, 50, 80, 90, 65, 70, 70, 60, 60, 35, 55, 55, 75, 23, 43, 75, 45, 80, 81, 70, 40, 45, 65, 25, 25, 51, 65, 75, 23, 50, 80, 25, 45, 65, 32, 52, 52, 55, 97, 50, 50, 100, 30, 50, 70, 50, 50, 50, 110, 110, 90, 90, 95, 100, 150, 31, 36, 56, 61, 81, 108, 40, 50, 60, 60, 80, 100, 31, 71, 25, 65, 45, 60, 70, 55, 90, 58, 58, 30, 30, 36, 36, 66, 70, 40, 95, 85, 115, 35, 85, 34, 39, 115, 70, 80, 85, 105, 105, 71, 85, 112, 45, 74, 84, 23, 33, 10, 60, 30, 91, 35, 42, 82, 102, 5, 60, 90, 32, 47, 65, 95, 50, 85, 46, 66, 91, 50, 40, 60, 125, 60, 50, 40, 50, 95, 83, 80, 95, 95, 65, 95, 80, 90, 80, 40, 45, 110, 91, 95, 80, 115, 90, 100, 77, 100, 90, 85, 80, 100, 125, 100, 120, 100, 63, 83, 113, 45, 55, 65, 45, 60, 70, 42, 77, 55, 60, 80, 66, 106, 64, 101, 64, 101, 64, 101, 24, 29, 43, 65, 93, 76, 116, 15, 20, 25, 72, 114, 68, 88, 50, 35, 40, 45, 64, 69, 74, 45, 85, 42, 42, 92, 57, 47, 112, 66, 116, 30, 90, 98, 65, 74, 92, 50, 95, 60, 55, 45, 48, 58, 97, 30, 30, 22, 32, 70, 110, 65, 75, 65, 105, 75, 115, 45, 55, 65, 20, 30, 30, 55, 98, 44, 59, 79, 75, 95, 103, 60, 20, 15, 30, 40, 60, 65)
speed <- c(speed, 65, 108, 10, 20, 30, 50, 90, 60, 40, 50, 30, 40, 20, 55, 80, 57, 67, 97, 40, 50, 105, 25, 145, 32, 65, 105, 48, 35, 55, 60, 70, 55, 60, 80, 60, 80, 65, 109, 38, 58, 98, 60, 100, 108, 108, 108, 111, 111, 90, 90, 101, 95, 108, 90, 99, 38, 57, 64, 60, 73, 104, 71, 97, 122, 57, 78, 62, 84, 126, 35, 29, 89, 72, 106, 42, 52, 75, 52, 68, 43, 58, 102, 68, 104, 28, 35, 60, 23, 29, 49, 72, 45, 73, 50, 68, 30, 44, 44, 59, 70, 109, 48, 71, 46, 58, 60, 118, 101, 50, 40, 60, 80, 75, 38, 56, 51, 84, 28, 28, 55, 123, 99, 99, 95, 50, 70, 70, 42, 52, 70, 70, 90, 60, 40, 50, 60, 65, 75, 60, 45, 45, 46, 36, 43, 63, 43, 93, 84, 124, 60, 112, 40, 45, 35, 45, 35, 27, 42, 35, 45, 15, 30, 77, 117, 50, 60, 32, 62, 72, 100, 60, 80, 80, 40, 15, 35, 5, 59, 95, 60, 65, 36, 96, 96, 92, 36, 40, 45, 65, 85, 130, 95, 75, 85, 37, 37, 97, 97, 103, 79, 151, 83, 61, 109, 43, 79, 65, 125, 80, 100, 100, 78, 145, 121, 72, 77, 77, 90, 90, 90, 90, 90, 90, 110, 40, 65, 65, 109, 90, 110, 90, 115, 150, 20, 35, 45, 30, 25, 50, 130, 45, 45, 100, 105, 81, 150, 130, 140, 45, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 30, 75, 75, 115, 71, 145, 100, 70, 100, 20, 50, 50, 100, 135, 105, 20, 80, 70, 70, 70, 75, 115, 100, 120, 110, 110, 110, 90, 90, 115, 150, 90, 180, 36, 36, 36, 36, 85, 34, 39, 135, 92, 112, 30, 110, 86, 86, 86, 86, 86, 90, 127, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 50, 98, 55, 75, 75, 75, 95, 95, 95, 121, 101, 91, 95, 95, 108, 128, 99, 99, 99, 99, 122, 132, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 42, 42, 42, 42, 52, 52, 52, 52, 92, 75, 75, 75, 75, 102, 102, 102, 102, 102, 102, 102, 102, 102, 104, 60, 56, 46, 41, 99, 69, 54, 99, 115, 115, 95, 85, 110, 80, 45, 43, 93, 93, 93, 82, 30, 45, 117, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 60, 60, 60, 60, 60, 60, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 85, 65)
attack <- c(49, 62, 82, 52, 64, 84, 48, 63, 83, 30, 20, 45, 35, 25, 90, 45, 60, 80, 56, 81, 60, 90, 60, 95, 55, 90, 75, 100, 47, 62, 92, 57, 72, 102, 45, 70, 41, 76, 45, 70, 45, 80, 50, 65, 80, 70, 95, 55, 65, 55, 100, 45, 70, 52, 82, 80, 105, 70, 110, 50, 65, 95, 20, 35, 50, 80, 100, 130, 75, 90, 105, 40, 70, 80, 95, 120, 85, 100, 65, 75, 35, 60, 90, 85, 110, 45, 70, 80, 105, 65, 95, 35, 50, 65, 45, 48, 73, 105, 130, 30, 50, 40, 95, 50, 80, 120, 105, 55, 65, 90, 85, 130, 5, 55, 95, 40, 65, 67, 92, 45, 75, 45, 110, 50, 83, 95, 125, 100, 10, 125, 85, 48, 55, 65, 65, 130, 60, 40, 60, 80, 115, 105, 110, 85, 90, 100, 64, 84, 134, 110, 100, 49, 62, 82, 52, 64, 84, 65, 80, 105, 46, 76, 30, 50, 20, 35, 60, 90, 90, 38, 58, 40, 25, 30, 20, 40, 50, 75, 40, 55, 75, 80, 20, 50, 100, 75, 35, 45, 55, 70, 30, 75, 65, 45, 85, 65, 65, 85, 75, 60, 72, 33, 80, 65, 90, 70, 75, 85, 80, 120, 95, 130, 10, 125, 95, 80, 130, 40, 50, 50, 100, 55, 65, 105, 55, 40, 80, 60, 90, 95, 60, 120, 80, 95, 20, 35, 95, 30, 63, 75, 80, 10, 85, 115, 75, 64, 84, 134, 90, 130, 100, 45, 65, 85, 60, 85, 120, 70, 85, 110, 55, 90, 30, 70, 45, 35, 70, 35, 50, 30, 50, 70, 40, 70, 100, 55, 85, 30, 50, 25, 35, 65, 30, 60, 40, 130, 60, 80, 160, 45, 90, 90, 51, 71, 91, 60, 120, 20, 45, 45, 65, 75, 85, 70, 90, 110, 40, 60, 45, 75, 50, 40, 73, 47, 60, 43, 73, 90, 120, 70, 90, 60, 100, 85, 25, 45, 60, 100, 70, 100, 85, 115, 40, 70, 115, 100, 55, 95, 48, 78, 80, 120, 40, 70, 41, 81, 95, 125, 15, 60, 70, 90, 75, 115, 40, 70, 68, 50, 130, 23, 50, 80, 40, 60, 80, 64, 104, 84, 90, 30, 75, 95, 135, 55, 75, 135, 100, 50, 75, 80, 90, 100, 150, 150, 100, 150, 68, 89, 109, 58, 78, 104, 51, 66, 86, 55, 75, 120, 45, 85, 25, 85, 65, 85, 120, 30, 70, 125, 165, 42, 52, 29, 59, 94, 30, 80, 45, 65, 105, 35, 60, 48, 83, 100, 50, 80, 66, 76, 60, 125, 55, 82, 30, 63, 93, 24, 89, 80, 25, 5, 65, 92, 70, 90, 130, 85, 70, 110, 72, 112, 50, 90, 61, 106, 100, 49, 69, 20, 62, 92, 120, 70, 85, 140, 100, 123, 95, 50, 76, 110, 60, 95, 130, 80, 125, 55, 100, 80, 50, 75, 105, 125, 120, 120, 90, 160, 100, 70, 80, 100, 90, 100, 120, 100, 45, 60, 75, 63, 93, 123, 55, 75, 100, 55, 85, 60, 80, 110, 50, 88, 53, 98, 53, 98, 53, 98, 25, 55, 55, 77, 115, 60, 100, 75, 105, 135, 45, 57, 85, 135, 60, 80, 105, 140, 50, 65, 95, 100, 125, 53, 63, 103, 45, 55, 100, 27, 67, 35, 60, 92, 72, 82, 117, 90, 140, 86, 65, 105, 75, 90, 58, 30, 50, 78, 108, 112, 140, 50, 95, 65, 105, 50, 95, 30, 45, 55, 30, 40, 65, 44, 87, 50, 65, 95, 60, 100, 75, 75, 135, 55, 85, 40, 60, 75, 47, 77, 50, 94, 55, 80, 100, 55, 85, 115, 55, 75, 30, 40, 55, 87, 117)
attack <- c(attack, 147, 70, 130, 50, 40, 70, 66, 85, 125, 120, 74, 124, 85, 125, 110, 83, 123, 55, 65, 97, 109, 65, 85, 105, 85, 60, 90, 129, 90, 115, 115, 120, 150, 125, 130, 72, 77, 120, 61, 78, 107, 45, 59, 69, 56, 63, 95, 36, 56, 50, 73, 81, 35, 22, 52, 50, 68, 38, 45, 65, 65, 100, 82, 124, 80, 48, 48, 80, 110, 50, 52, 72, 48, 80, 54, 92, 52, 105, 60, 75, 53, 73, 38, 55, 89, 121, 59, 77, 65, 92, 58, 50, 50, 75, 100, 80, 70, 110, 66, 90, 69, 117, 30, 70, 131, 131, 100, 100, 110, 110, 55, 75, 107, 65, 85, 115, 54, 69, 74, 75, 85, 120, 70, 110, 62, 82, 70, 82, 132, 70, 45, 55, 65, 115, 20, 53, 63, 100, 125, 40, 70, 55, 105, 35, 45, 44, 64, 75, 125, 30, 40, 120, 52, 60, 120, 35, 125, 55, 75, 60, 95, 95, 60, 115, 78, 98, 90, 105, 60, 131, 55, 75, 110, 115, 85, 130, 75, 29, 29, 137, 113, 53, 139, 137, 89, 101, 181, 101, 107, 95, 125, 100, 130, 104, 103, 150, 80, 56, 71, 71, 55, 55, 55, 55, 55, 55, 85, 75, 100, 41, 67, 55, 100, 35, 60, 50, 80, 95, 120, 75, 80, 105, 65, 105, 80, 125, 155, 155, 135, 190, 150, 95, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 125, 150, 185, 90, 164, 110, 160, 150, 85, 85, 105, 140, 100, 75, 140, 120, 110, 70, 70, 70, 165, 150, 120, 145, 145, 100, 130, 150, 180, 180, 180, 70, 95, 29, 29, 79, 69, 60, 48, 83, 136, 170, 145, 132, 165, 65, 65, 65, 65, 65, 120, 103, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 60, 92, 30, 60, 60, 60, 100, 100, 100, 100, 105, 145, 120, 170, 72, 128, 120, 120, 120, 120, 95, 145, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 38, 38, 38, 38, 45, 45, 45, 45, 65, 65, 65, 65, 65, 80, 80, 80, 80, 80, 80, 80, 80, 80, 48, 150, 66, 66, 66, 85, 95, 100, 131, 100, 100, 100, 100, 160, 160, 110, 70, 70, 70, 70, 115, 140, 105, 64, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 60, 60, 60, 60, 60, 60, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 110, 95)
hp <- c(45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, 60, 40, 45, 65, 40, 63, 83, 30, 55, 40, 65, 35, 60, 35, 60, 50, 75, 55, 70, 90, 46, 61, 81, 70, 95, 38, 73, 115, 140, 40, 75, 45, 60, 75, 35, 60, 60, 70, 10, 35, 40, 65, 50, 80, 40, 65, 55, 90, 40, 65, 90, 25, 40, 55, 70, 80, 90, 50, 65, 80, 40, 80, 40, 55, 80, 50, 65, 90, 95, 25, 50, 52, 35, 60, 65, 90, 80, 105, 30, 50, 30, 45, 60, 35, 60, 85, 30, 55, 40, 60, 60, 95, 50, 60, 50, 50, 90, 40, 65, 80, 105, 250, 65, 105, 30, 55, 45, 80, 30, 60, 40, 70, 65, 65, 65, 65, 75, 20, 95, 130, 48, 55, 130, 65, 65, 65, 35, 70, 30, 60, 80, 160, 90, 90, 90, 41, 61, 91, 106, 100, 45, 60, 80, 39, 58, 78, 50, 65, 85, 35, 85, 60, 100, 40, 55, 40, 70, 85, 75, 125, 20, 50, 90, 35, 55, 40, 65, 55, 70, 90, 75, 70, 100, 70, 90, 35, 55, 75, 55, 30, 75, 65, 55, 95, 65, 95, 60, 95, 60, 48, 190, 70, 50, 75, 100, 65, 75, 60, 90, 65, 70, 20, 80, 55, 60, 90, 40, 60, 50, 100, 65, 35, 75, 45, 85, 65, 45, 75, 75, 90, 90, 85, 73, 55, 35, 50, 45, 45, 45, 95, 255, 90, 115, 100, 50, 70, 100, 106, 106, 100, 40, 50, 70, 45, 60, 80, 50, 70, 100, 35, 70, 38, 78, 45, 50, 60, 50, 60, 40, 60, 80, 40, 70, 90, 40, 60, 40, 60, 28, 38, 68, 40, 70, 60, 60, 60, 80, 150, 31, 61, 1, 64, 84, 104, 72, 144, 50, 30, 50, 70, 50, 50, 50, 60, 70, 30, 60, 40, 70, 60, 60, 65, 65, 50, 70, 100, 45, 70, 130, 170, 60, 70, 70, 60, 80, 60, 45, 50, 80, 50, 70, 45, 75, 73, 73, 90, 90, 50, 110, 43, 63, 40, 60, 66, 86, 45, 75, 20, 95, 70, 60, 44, 64, 20, 40, 99, 75, 65, 95, 50, 80, 70, 90, 110, 35, 55, 55, 100, 43, 45, 65, 95, 40, 60, 80, 80, 80, 80, 80, 80, 100, 100, 105, 100, 50, 55, 75, 95, 44, 64, 76, 53, 64, 84, 40, 55, 85, 59, 79, 37, 77, 45, 60, 80, 40, 60, 67, 97, 30, 60, 40, 60, 70, 30, 70, 60, 55, 85, 45, 70, 76, 111, 75, 90, 150, 55, 65, 60, 100, 49, 71, 45, 63, 103, 57, 67, 50, 20, 100, 76, 50, 58, 68, 108, 135, 40, 70, 68, 108, 40, 70, 48, 83, 74, 49, 69, 45, 60, 90, 70, 70, 110, 115, 100, 75, 75, 85, 86, 65, 65, 75, 110, 85, 68, 60, 45, 70, 50, 75, 80, 75, 100, 90, 91, 110, 150, 120, 80, 100, 70, 100, 120, 100, 45, 60, 75, 65, 90, 110, 55, 75, 95, 45, 60, 45, 65, 85, 41, 64, 50, 75, 50, 75, 50, 75, 76, 116, 50, 62, 80, 45, 75, 55, 70, 85, 65, 67, 60, 110, 103, 75, 85, 105, 50, 75, 105, 120, 75, 45, 55, 75, 30, 40, 60, 40, 60, 45, 70, 70, 50, 60, 95, 70, 105, 75, 50, 70, 50, 65, 72, 38, 58, 54, 74, 55, 75, 50, 80, 40, 60, 55, 75, 45, 60)
hp <- c(hp, 70, 45, 65, 110, 62, 75, 36, 51, 71, 60, 80, 55, 50, 70, 69, 114, 55, 100, 165, 50, 70, 44, 74, 40, 60, 60, 35, 65, 85, 55, 75, 50, 60, 60, 46, 66, 76, 55, 95, 80, 50, 80, 109, 45, 65, 77, 59, 89, 45, 65, 95, 70, 100, 70, 110, 85, 58, 52, 72, 92, 55, 85, 91, 91, 91, 79, 79, 100, 100, 89, 125, 91, 100, 71, 56, 61, 88, 40, 59, 75, 41, 54, 72, 38, 85, 45, 62, 78, 38, 45, 80, 62, 86, 44, 54, 78, 66, 123, 67, 95, 75, 62, 74, 45, 59, 60, 78, 101, 62, 82, 53, 86, 42, 72, 50, 65, 50, 71, 44, 62, 58, 82, 77, 123, 95, 78, 67, 50, 45, 68, 90, 57, 43, 85, 49, 65, 55, 95, 40, 85, 126, 126, 108, 50, 80, 80, 68, 78, 78, 45, 65, 95, 50, 60, 80, 35, 55, 80, 48, 88, 47, 57, 77, 47, 97, 75, 40, 60, 45, 75, 45, 50, 50, 70, 100, 38, 68, 40, 70, 40, 60, 48, 68, 70, 120, 42, 52, 72, 51, 90, 100, 25, 75, 55, 85, 55, 95, 95, 60, 65, 60, 65, 55, 68, 78, 70, 45, 55, 75, 70, 70, 70, 70, 43, 43, 137, 137, 109, 107, 71, 83, 97, 59, 223, 97, 80, 90, 80, 78, 78, 79, 65, 83, 30, 75, 75, 35, 35, 35, 35, 35, 35, 60, 50, 75, 38, 73, 10, 35, 40, 65, 55, 40, 55, 80, 95, 80, 105, 60, 95, 60, 105, 65, 95, 80, 106, 106, 90, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 75, 70, 80, 75, 100, 70, 80, 100, 68, 50, 50, 70, 60, 70, 70, 70, 75, 70, 70, 70, 64, 65, 80, 95, 80, 80, 80, 100, 100, 105, 50, 50, 50, 40, 40, 60, 60, 70, 76, 111, 65, 108, 70, 90, 68, 50, 50, 50, 50, 50, 150, 100, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 103, 70, 105, 60, 60, 60, 80, 80, 80, 79, 79, 89, 125, 125, 91, 100, 71, 71, 71, 71, 72, 72, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 44, 44, 44, 44, 54, 54, 54, 54, 74, 78, 78, 78, 78, 75, 75, 75, 75, 75, 75, 75, 75, 75, 74, 60, 44, 54, 59, 55, 75, 85, 126, 54, 54, 108, 216, 50, 80, 88, 77, 75, 75, 75, 85, 45, 70, 68, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 75, 80)
spatk <- c(65, 80, 100, 60, 80, 109, 50, 65, 85, 20, 25, 90, 20, 25, 45, 35, 50, 70, 25, 50, 31, 61, 40, 65, 50, 90, 20, 45, 40, 55, 75, 40, 55, 85, 60, 95, 50, 81, 45, 85, 30, 65, 75, 85, 110, 45, 60, 40, 90, 35, 50, 40, 65, 65, 95, 35, 60, 70, 100, 40, 50, 70, 105, 120, 135, 35, 50, 65, 70, 85, 100, 50, 80, 30, 45, 55, 65, 80, 40, 100, 95, 120, 58, 35, 60, 45, 70, 40, 65, 45, 85, 100, 115, 130, 30, 43, 73, 25, 50, 55, 80, 60, 125, 40, 50, 35, 35, 60, 60, 85, 30, 45, 35, 100, 40, 70, 95, 35, 65, 70, 100, 100, 55, 115, 95, 100, 55, 40, 15, 60, 85, 48, 45, 110, 110, 95, 85, 90, 115, 55, 65, 60, 65, 95, 125, 125, 50, 70, 100, 154, 100, 49, 63, 83, 60, 80, 109, 44, 59, 79, 35, 45, 36, 86, 40, 55, 40, 60, 70, 56, 76, 35, 45, 40, 40, 80, 70, 95, 65, 80, 115, 90, 20, 60, 30, 90, 35, 45, 55, 40, 30, 105, 75, 25, 65, 130, 60, 85, 100, 85, 72, 33, 90, 35, 60, 65, 35, 55, 40, 60, 55, 55, 10, 40, 35, 50, 75, 70, 90, 30, 60, 65, 65, 105, 65, 80, 40, 80, 110, 95, 40, 60, 105, 85, 20, 35, 35, 85, 65, 70, 40, 75, 115, 90, 90, 45, 65, 95, 90, 110, 100, 65, 85, 105, 70, 85, 110, 50, 60, 85, 30, 60, 30, 50, 20, 25, 100, 25, 50, 40, 60, 90, 30, 60, 90, 30, 75, 55, 95, 45, 65, 125, 50, 100, 40, 60, 35, 55, 95, 30, 50, 30, 51, 71, 91, 20, 40, 20, 45, 35, 55, 65, 55, 40, 50, 60, 40, 60, 65, 105, 85, 75, 47, 73, 100, 43, 73, 65, 95, 70, 90, 65, 105, 85, 70, 90, 60, 45, 50, 80, 85, 115, 40, 70, 60, 100, 95, 55, 46, 76, 50, 90, 40, 70, 61, 81, 40, 70, 10, 100, 70, 60, 63, 83, 30, 60, 72, 95, 75, 23, 50, 80, 55, 75, 95, 74, 94, 114, 45, 40, 40, 60, 110, 35, 55, 95, 50, 100, 75, 110, 130, 150, 100, 150, 100, 150, 45, 55, 75, 58, 78, 104, 61, 81, 111, 30, 40, 50, 35, 55, 25, 55, 40, 60, 95, 50, 125, 30, 65, 42, 47, 29, 79, 94, 30, 80, 45, 60, 85, 62, 87, 57, 92, 60, 60, 90, 44, 54, 105, 105, 42, 64, 65, 41, 71, 24, 79, 10, 70, 15, 92, 92, 40, 50, 80, 40, 35, 115, 38, 68, 30, 60, 61, 86, 90, 49, 69, 60, 62, 92, 45, 130, 80, 55, 110, 95, 125, 120, 116, 60, 130, 45, 70, 135, 65, 75, 65, 80, 95, 75, 105, 125, 150, 150, 130, 80, 100, 75, 80, 100, 135, 100, 120, 100, 45, 60, 75, 45, 70, 100, 63, 83, 108, 35, 60, 25, 35, 45, 50, 88, 53, 98, 53, 98, 53, 98, 67, 107, 36, 50, 65, 50, 80, 25, 50, 60, 55, 77, 30, 50, 60, 25, 40, 55, 50, 65, 85, 30, 30, 40, 50, 70, 30, 40, 55, 37, 77, 70, 110, 80, 35, 45, 65, 15, 30, 106, 35, 65, 35, 45, 103, 55, 95, 53, 83, 74, 112, 40, 60, 80, 120, 40, 65, 55, 75, 95, 105, 125, 125, 44, 87, 65, 80, 110, 40, 60, 75, 40, 60, 55, 85, 65, 85, 40, 57, 97, 24, 54, 45, 70, 70, 45, 75, 105, 85, 125, 65, 95, 145, 30, 40, 60, 60, 70, 95, 40)
spatk <- c(spatk, 100, 81, 55, 95, 60, 35, 55, 40, 60, 40, 37, 57, 45, 55, 105, 48, 45, 65, 125, 50, 135, 90, 72, 90, 125, 125, 150, 120, 115, 130, 129, 128, 120, 48, 56, 74, 62, 90, 114, 62, 83, 103, 32, 50, 40, 56, 74, 27, 27, 90, 73, 109, 61, 75, 112, 62, 97, 46, 69, 65, 63, 83, 35, 45, 50, 63, 99, 59, 85, 37, 68, 39, 54, 60, 97, 58, 120, 61, 109, 45, 69, 67, 99, 110, 74, 81, 50, 55, 83, 110, 80, 50, 65, 44, 58, 32, 44, 45, 97, 131, 131, 81, 100, 150, 130, 50, 70, 100, 60, 80, 80, 66, 91, 126, 30, 40, 75, 30, 55, 55, 55, 145, 42, 62, 98, 55, 95, 30, 55, 25, 43, 53, 45, 55, 40, 50, 50, 80, 65, 90, 71, 111, 45, 55, 30, 40, 50, 82, 90, 40, 20, 60, 70, 100, 30, 95, 95, 60, 75, 91, 40, 50, 70, 135, 86, 45, 65, 100, 95, 130, 85, 95, 29, 29, 113, 137, 127, 53, 137, 173, 107, 59, 97, 127, 130, 90, 122, 130, 159, 135, 15, 135, 25, 40, 40, 50, 50, 50, 50, 50, 50, 95, 10, 25, 50, 81, 35, 50, 50, 75, 175, 30, 45, 55, 130, 40, 65, 170, 125, 50, 60, 65, 70, 70, 154, 194, 165, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 55, 65, 40, 140, 95, 145, 130, 95, 165, 85, 55, 60, 80, 135, 110, 145, 110, 70, 70, 70, 93, 115, 120, 120, 105, 140, 160, 180, 150, 180, 180, 70, 95, 29, 29, 59, 69, 87, 57, 92, 54, 120, 140, 132, 65, 105, 105, 105, 105, 105, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 80, 80, 140, 40, 40, 40, 60, 60, 60, 110, 145, 105, 170, 120, 129, 77, 120, 120, 120, 120, 103, 153, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 61, 61, 61, 61, 75, 75, 75, 75, 125, 112, 112, 112, 112, 65, 65, 65, 65, 65, 65, 65, 65, 65, 83, 150, 44, 44, 44, 58, 58, 58, 131, 61, 61, 81, 91, 160, 170, 55, 145, 98, 98, 98, 55, 140, 80, 111, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 60, 60, 60, 60, 60, 60, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 130)
spdef <- c(65, 80, 100, 50, 65, 85, 64, 80, 105, 20, 25, 80, 20, 25, 80, 35, 50, 70, 35, 70, 31, 61, 54, 79, 50, 80, 30, 55, 40, 55, 85, 40, 55, 75, 65, 90, 65, 100, 25, 50, 40, 75, 65, 75, 90, 55, 80, 55, 75, 45, 70, 40, 65, 50, 80, 45, 70, 50, 80, 40, 50, 90, 55, 70, 95, 35, 60, 85, 30, 45, 70, 100, 120, 30, 45, 65, 65, 80, 40, 80, 55, 70, 62, 35, 60, 70, 95, 50, 100, 25, 45, 35, 55, 75, 45, 90, 115, 25, 50, 55, 80, 45, 75, 50, 80, 110, 110, 75, 45, 70, 30, 45, 105, 40, 80, 25, 45, 50, 80, 55, 85, 120, 80, 95, 85, 85, 70, 70, 20, 100, 95, 48, 65, 95, 95, 110, 75, 55, 70, 45, 70, 75, 110, 125, 90, 85, 50, 70, 100, 90, 100, 65, 80, 100, 50, 65, 85, 48, 63, 83, 45, 55, 56, 96, 80, 110, 40, 70, 80, 56, 76, 35, 55, 20, 65, 105, 45, 70, 45, 60, 90, 100, 50, 80, 65, 100, 55, 65, 95, 55, 30, 85, 45, 25, 65, 95, 130, 42, 110, 85, 48, 58, 65, 35, 60, 65, 65, 65, 40, 60, 55, 80, 230, 95, 75, 50, 75, 40, 80, 30, 60, 95, 35, 75, 45, 140, 70, 50, 80, 95, 40, 60, 95, 65, 45, 35, 110, 65, 55, 55, 70, 135, 100, 75, 115, 50, 70, 100, 154, 154, 100, 55, 65, 85, 50, 60, 70, 50, 70, 90, 30, 60, 41, 61, 30, 25, 50, 25, 90, 50, 70, 100, 30, 40, 60, 30, 50, 30, 70, 35, 55, 115, 52, 82, 60, 60, 35, 55, 65, 30, 50, 30, 23, 43, 73, 30, 60, 40, 90, 35, 55, 65, 55, 40, 50, 60, 55, 75, 40, 60, 75, 85, 85, 85, 80, 53, 83, 20, 40, 35, 45, 45, 75, 70, 80, 110, 60, 45, 50, 80, 40, 60, 75, 105, 60, 60, 85, 65, 41, 71, 35, 55, 70, 120, 87, 107, 50, 80, 55, 125, 70, 120, 33, 63, 90, 130, 87, 90, 60, 48, 50, 80, 50, 70, 90, 55, 75, 75, 65, 65, 30, 50, 80, 60, 80, 90, 100, 200, 150, 130, 110, 140, 90, 90, 100, 50, 55, 65, 85, 44, 52, 71, 56, 76, 101, 30, 40, 60, 40, 60, 41, 51, 34, 49, 79, 70, 105, 30, 50, 88, 138, 45, 105, 50, 42, 102, 90, 30, 50, 53, 78, 62, 82, 66, 44, 54, 56, 96, 105, 52, 37, 59, 50, 41, 61, 86, 116, 45, 90, 65, 42, 108, 45, 55, 85, 85, 40, 70, 42, 72, 55, 75, 40, 65, 72, 61, 86, 120, 60, 85, 85, 90, 95, 55, 50, 85, 95, 115, 56, 65, 95, 75, 60, 75, 115, 150, 135, 70, 77, 130, 105, 70, 100, 120, 106, 110, 120, 130, 80, 100, 90, 100, 120, 100, 55, 75, 95, 45, 55, 65, 45, 60, 70, 39, 69, 45, 65, 90, 37, 50, 48, 63, 48, 63, 48, 63, 55, 95, 30, 42, 55, 32, 63, 25, 40, 80, 43, 55, 45, 65, 86, 35, 50, 65, 40, 55, 75, 85, 75, 60, 80, 80, 39, 79, 69, 50, 75, 50, 75, 55, 35, 45, 70, 45, 55, 67, 35, 75, 70, 115, 80, 65, 105, 45, 65, 45, 65, 62, 82, 40, 60, 40, 60, 65, 85, 110, 50, 60, 85, 50, 63, 60, 75, 95, 50, 70, 60, 45, 105, 55, 80, 85, 105, 45, 50, 60, 86, 116, 60, 85, 85, 40, 70, 80, 55, 95, 55, 60, 90, 40, 50, 70, 40, 80, 135, 65, 60, 99, 50, 60)
spdef <- c(spdef, 90, 50, 80, 40, 70, 95, 50, 75, 65, 95, 66, 48, 50, 70, 90, 55, 105, 72, 90, 129, 80, 80, 120, 100, 80, 90, 90, 128, 95, 45, 58, 75, 60, 70, 100, 44, 56, 71, 36, 77, 38, 52, 69, 25, 30, 50, 54, 66, 79, 98, 154, 57, 81, 48, 71, 90, 60, 81, 37, 49, 150, 65, 89, 57, 75, 46, 75, 56, 86, 60, 123, 63, 89, 43, 94, 45, 59, 63, 92, 130, 63, 67, 150, 75, 113, 150, 87, 60, 82, 55, 75, 35, 46, 40, 80, 98, 98, 95, 150, 130, 90, 50, 70, 100, 40, 50, 90, 56, 81, 116, 30, 50, 75, 30, 60, 45, 75, 75, 47, 67, 70, 40, 70, 40, 65, 25, 52, 142, 55, 85, 72, 132, 35, 90, 75, 100, 40, 60, 50, 60, 38, 48, 98, 110, 110, 60, 30, 90, 45, 75, 130, 95, 95, 100, 95, 85, 73, 105, 70, 91, 90, 45, 70, 105, 75, 115, 95, 130, 31, 131, 89, 107, 131, 53, 37, 71, 101, 31, 53, 89, 115, 90, 120, 85, 115, 115, 80, 80, 35, 80, 80, 50, 50, 50, 50, 50, 50, 85, 35, 65, 65, 100, 45, 70, 40, 65, 95, 30, 45, 65, 80, 50, 100, 95, 75, 80, 100, 90, 130, 95, 100, 120, 110, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 95, 100, 105, 90, 120, 85, 80, 110, 135, 115, 95, 80, 85, 80, 65, 105, 105, 70, 70, 70, 83, 60, 80, 90, 110, 150, 120, 160, 90, 100, 20, 160, 90, 45, 45, 85, 95, 78, 62, 82, 96, 95, 70, 105, 115, 107, 107, 107, 107, 107, 100, 75, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 126, 55, 105, 50, 50, 50, 70, 70, 70, 90, 80, 80, 100, 90, 90, 77, 95, 95, 95, 95, 71, 71, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 79, 79, 79, 79, 98, 98, 98, 98, 128, 154, 154, 154, 154, 90, 90, 90, 90, 90, 90, 90, 90, 90, 81, 50, 55, 55, 55, 75, 75, 75, 98, 85, 85, 95, 95, 110, 130, 60, 75, 70, 70, 70, 75, 135, 90, 60, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 100, 100, 100, 100, 100, 100, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 105, 115)

poke_df <- data.frame(hp, attack, defense, spatk, spdef, speed)
glimpse(poke_df)
## Observations: 1,061
## Variables: 6
## $ hp      <dbl> 45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, 60, 40, 45...
## $ attack  <dbl> 49, 62, 82, 52, 64, 84, 48, 63, 83, 30, 20, 45, 35, 25...
## $ defense <dbl> 49, 63, 83, 43, 58, 78, 65, 80, 100, 35, 55, 50, 30, 5...
## $ spatk   <dbl> 65, 80, 100, 60, 80, 109, 50, 65, 85, 20, 25, 90, 20, ...
## $ spdef   <dbl> 65, 80, 100, 50, 65, 85, 64, 80, 105, 20, 25, 80, 20, ...
## $ speed   <dbl> 45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30, 70, 50, 3...
library(caret)

# Select the variables
poke_vars <- poke_df %>%
    select(defense, speed) 

# Perform a Box-Cox transformation
processed_vars <- preProcess(poke_vars, method = c("BoxCox"))

# Use predict to transform data
poke_df <- predict(processed_vars, poke_df)

# Plot transformed features
ggplot(poke_df, aes(x=defense)) + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(poke_df, aes(x=speed)) + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

duration <- c(261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517, 71, 174, 353, 98, 38, 219, 54, 262, 164, 160, 342, 181, 172, 296, 127, 255, 348, 225, 230, 208, 226, 336, 242, 365, 1666, 577, 137, 160, 180, 22, 1492, 616, 242, 355, 225, 160, 363, 266, 253, 179, 787, 145, 174, 104, 13, 185, 1778, 138, 812, 164, 391, 357, 91, 528, 273, 158, 177, 258, 172, 154, 291, 181, 176, 211, 349, 272, 208, 193, 212, 20, 1042, 246, 529, 1467, 1389, 188, 180, 48, 213, 583, 221, 173, 426, 287, 101, 203, 197, 257, 124, 229, 55, 400, 197, 190, 21, 514, 849, 194, 144, 212, 286, 107, 247, 518, 364, 178, 98, 439, 79, 120, 127, 175, 262, 61, 78, 143, 579, 677, 345, 185, 100, 125, 193, 136, 73, 528, 541, 163, 301, 46, 204, 98, 71, 157, 243, 186, 579, 163, 610, 2033, 85, 114, 114, 57, 238, 93, 128, 107, 181, 303, 558, 270, 228, 99, 240, 673, 233, 1056, 250, 252, 138, 130, 412, 179, 19, 458, 717, 313, 683, 1077, 416, 146, 167, 315, 140, 346, 562, 172, 217, 142, 67, 291, 309, 248, 98, 256, 82, 577, 286, 477, 611, 471, 381, 42, 251, 408, 215, 287, 216, 366, 210, 288, 168, 338, 410, 177, 127, 357, 175, 300, 136, 1419, 125, 213, 27, 238, 124, 18, 730, 746, 121, 247, 40, 181, 79, 206, 389, 127, 702, 151, 117, 232, 408, 179, 39, 282, 714)
balance <- c(2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71, 162, 229, 13, 52, 60, 0, 723, 779, 23, 50, 0, -372, 255, 113, -246, 265, 839, 378, 39, 0, 10635, 63, -7, -3, 506, 0, 2586, 49, 104, 529, 96, -171, -364, 0, 0, 0, 1291, -244, 0, -76, -103, 243, 424, 306, 24, 179, 0, 989, 249, 790, 154, 6530, 100, 59, 1205, 12223, 5935, 25, 282, 23, 1937, 384, 582, 91, 0, 1, 206, 164, 690, 2343, 137, 173, 45, 1270, 16, 486, 50, 152, 290, 54, -37, 101, 383, 81, 0, 229, -674, 90, 128, 179, 0, 54, 151, 61, 30, 523, 31, 79, -34, 448, 81, 144, 351, -67, 262, 0, 56, 26, 3, 41, 7, 105, 818, -16, 0, 2476, 1185, 217, 1685, 802, 0, 94, 0, 0, 517, 265, 947, 3, 42, 37, 57, 22, 8, 293, 3, 348, -19, 0, -4, 18, 139, 0, 1883, 216, 782, 904, 1705, 47, 176, 1225, 86, 82, 271, 1378, 184, 0, 0, 1357, 19, 434, 92, 1151, 41, 51, 214, 1161, 37, 787, 59, 253, 211, 235, 4384, 4080, 53, 0, 2127, 377, 73, 445, 243, 307, 155, 173, 400, 1428, 219, 7, 575, 298, 0, 5699, 176, 517, 257, 56, -390, 330, 195, 301, -41, 483, 28, 13, 965, 378, 219, 324, -69, 0, 205, 278, 1065, 34, 1033, 1467, -12, 388, 294, 1827, 627, 25, 315, 0, 66, -9, 349, 100, 0, 434, 3237, 275, 0, 207, 483, 2248)

bank_df <- data.frame(balance, duration)
glimpse(bank_df)
## Observations: 250
## Variables: 2
## $ balance  <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390...
## $ duration <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 13...
# Select both variables
bank_vars <- bank_df %>%
    select(balance, duration)
    
# Perform a Yeo-Johnson transformation 
processed_vars <- preProcess(bank_vars, method = c("YeoJohnson"))

# Use predict to transform data
bank_df <- predict(processed_vars, bank_df)

# Plot transformed features
ggplot(bank_df, aes(x=balance)) + 
    geom_density()

ggplot(bank_df, aes(x=duration)) + 
    geom_density()

# Create a scaled new feature scaled_hp
poke_df <- poke_df %>% 
    mutate(scaled_hp = (hp - min(hp)) / (max(hp) - min(hp)))

# Summarize both features
poke_df %>% 
    select(hp, scaled_hp) %>% 
    summary()
##        hp        scaled_hp    
##  Min.   :  1   Min.   :0.000  
##  1st Qu.: 50   1st Qu.:0.193  
##  Median : 68   Median :0.264  
##  Mean   : 70   Mean   :0.272  
##  3rd Qu.: 80   3rd Qu.:0.311  
##  Max.   :255   Max.   :1.000
# Use mutate to create column attack_mc
poke_df <- poke_df %>% 
    mutate(attack_mc = attack - mean(attack))


# Select variables 
poke_vars <- poke_df %>% 
    select(attack, spatk, spdef)
    
# Use preProcess to mean center variables
processed_vars <- preProcess(poke_vars, method=c("center"))

# Use predict to include tranformed variables into data
poke_df <- predict(processed_vars, poke_df)

# Summarize the three new column scales
poke_df %>% 
    select(attack, spatk, spdef) %>% 
    summary()
##      attack          spatk           spdef      
##  Min.   :-74.6   Min.   :-64.6   Min.   :-52.9  
##  1st Qu.:-24.6   1st Qu.:-24.6   1st Qu.:-22.9  
##  Median : -4.6   Median : -4.6   Median : -2.9  
##  Mean   :  0.0   Mean   :  0.0   Mean   :  0.0  
##  3rd Qu.: 20.4   3rd Qu.: 20.4   3rd Qu.: 17.1  
##  Max.   :110.4   Max.   :119.4   Max.   :157.1
# Standardize Speed
poke_df <- poke_df %>% 
    mutate(z_speed = (speed - mean(speed)) / sd(speed))

# Summarize new and original variable
poke_df %>% 
    select(speed, z_speed) %>% 
    summary()
##      speed         z_speed     
##  Min.   : 3.0   Min.   :-2.77  
##  1st Qu.:20.0   1st Qu.:-0.72  
##  Median :26.0   Median :-0.01  
##  Mean   :26.1   Mean   : 0.00  
##  3rd Qu.:32.7   3rd Qu.: 0.79  
##  Max.   :52.7   Max.   : 3.20
# Select variables 
poke_vars <- poke_df %>% 
    select(attack, defense, spatk, spdef)

# Create preProcess variable list 
processed_vars <- preProcess(poke_vars, method = c("center", "scale"))

# Use predict to assign standardized variables
poke_df <- predict(processed_vars, poke_df)

# Summarize new variables
poke_df %>% 
    select(attack, defense, spatk, spdef) %>% 
    summary()
##      attack         defense          spatk           spdef      
##  Min.   :-2.38   Min.   :-4.39   Min.   :-2.02   Min.   :-1.89  
##  1st Qu.:-0.78   1st Qu.:-0.76   1st Qu.:-0.77   1st Qu.:-0.82  
##  Median :-0.15   Median : 0.02   Median :-0.14   Median :-0.10  
##  Mean   : 0.00   Mean   : 0.00   Mean   : 0.00   Mean   : 0.00  
##  3rd Qu.: 0.65   3rd Qu.: 0.68   3rd Qu.: 0.64   3rd Qu.: 0.61  
##  Max.   : 3.52   Max.   : 3.47   Max.   : 3.74   Max.   : 5.61

Chapter 4 - Advanced Methods

Feature Crossing:

  • Feature crossing is combining 2+ features in to a single new feature
  • Can be useful to visualize features prior to feature crossing
  • Example for creating feature crossing
    • dmy <- dummyVars(~ gender:infraction, data = myDF)
    • newDF <- predict(dmy, myDF)

Principal Component Analysis:

  • PCA is a linear combination of the original features, with each of the axes in the new space being orthogonal
    • PCA is typically used for dimensionality reduction
    • prcomp(myData, center=TRUE, scale=TRUE) # run the pre-processing steps to create N(0, 1) on each underlying feature

Interpreting PCA Output:

  • Can inspect and interpret the PCA outputs
    • summary(myPCA) # SD, Var. CumVar by component
    • myPCA$sdev ** 2 # Variance
    • autoplot(myPCA, data=myData, colour=“myColorVar”)

Wrap Up:

  • Developing meaningful features from categorical data
  • Bucketing and binning for numerical variables, and date-times
  • Transformations such as Box-Cox or Yeo-Johnson and z-score
  • Principal Component Analysis (PCA)

Example code includes:

# Copy over first 250 records
gender <- stringr::str_trim(c(c(' Male', ' Male', ' Male', ' Male', ' Female', ' Female', ' Female', ' Male', ' Female', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Male', ' Female', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Female', ' Female', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Female', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Female', ' Male', ' Female', ' Female', ' Male', ' Male', ' Male', ' Female', ' Male', ' Female', ' Female', ' Female', ' Male', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Female', ' Female', ' Male', ' Male', ' Female', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Female', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Female', ' Female', ' Male', ' Male', ' Male', ' Female', ' Female', ' Male', ' Male', ' Female', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Female', ' Male', ' Female', ' Male', ' Female', ' Male', ' Female', ' Male', ' Female', ' Female', ' Female', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Female', ' Female', ' Female', ' Male', ' Female', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Female', ' Female', ' Female', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Female', ' Male', ' Male', ' Female', ' Female', ' Female'), c(' Male', ' Male', ' Female', ' Female', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Female', ' Female', ' Male', ' Female', ' Female', ' Male', ' Female', ' Male', ' Female', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male', ' Male', ' Male', ' Male', ' Female', ' Male', ' Male')))
occ1 <- c(' Adm-clerical', ' Exec-managerial', ' Handlers-cleaners', ' Handlers-cleaners', ' Prof-specialty', ' Exec-managerial', ' Other-service', ' Exec-managerial', ' Prof-specialty', ' Exec-managerial', ' Exec-managerial', ' Prof-specialty', ' Adm-clerical', ' Sales', ' Craft-repair', ' Transport-moving', ' Farming-fishing', ' Machine-op-inspct', ' Sales', ' Exec-managerial', ' Prof-specialty', ' Other-service', ' Farming-fishing', ' Transport-moving', ' Tech-support', ' Tech-support', ' Craft-repair', 'unknown', ' Exec-managerial', ' Craft-repair', ' Protective-serv', ' Sales', ' Exec-managerial', ' Adm-clerical', ' Other-service', ' Machine-op-inspct', ' Machine-op-inspct', ' Adm-clerical', ' Sales', ' Prof-specialty', ' Machine-op-inspct', ' Prof-specialty', ' Tech-support', ' Adm-clerical', ' Handlers-cleaners', ' Prof-specialty', ' Machine-op-inspct', ' Exec-managerial', ' Craft-repair', ' Prof-specialty', ' Exec-managerial', ' Other-service', ' Prof-specialty', ' Exec-managerial', ' Exec-managerial', ' Tech-support', ' Machine-op-inspct', ' Other-service', ' Adm-clerical', ' Machine-op-inspct', ' Sales', 'unknown', ' Transport-moving', ' Prof-specialty', ' Tech-support', ' Craft-repair', ' Adm-clerical', ' Adm-clerical', ' Exec-managerial', 'unknown', ' Prof-specialty', ' Sales', ' Sales', ' Machine-op-inspct', ' Prof-specialty', ' Other-service', ' Adm-clerical', 'unknown', ' Other-service', ' Farming-fishing', ' Sales', ' Other-service', ' Other-service', ' Sales', ' Craft-repair', ' Sales', ' Protective-serv', ' Prof-specialty', ' Sales', ' Prof-specialty', ' Prof-specialty', ' Craft-repair', ' Machine-op-inspct', ' Sales', ' Protective-serv', ' Handlers-cleaners', ' Prof-specialty', ' Sales', ' Exec-managerial', ' Other-service', ' Exec-managerial')
occ2 <- c(' Exec-managerial', ' Prof-specialty', ' Tech-support', ' Craft-repair', ' Craft-repair', 'unknown', ' Handlers-cleaners', ' Adm-clerical', ' Handlers-cleaners', ' Sales', ' Prof-specialty', ' Other-service', ' Sales', ' Machine-op-inspct', ' Handlers-cleaners', ' Sales', ' Craft-repair', ' Sales', ' Craft-repair', ' Other-service', ' Exec-managerial', ' Exec-managerial', ' Prof-specialty', ' Other-service', ' Exec-managerial', ' Adm-clerical', ' Adm-clerical', 'unknown', ' Craft-repair', ' Sales', ' Other-service', ' Craft-repair', ' Sales', ' Tech-support', ' Prof-specialty', ' Craft-repair', ' Adm-clerical', ' Sales', ' Craft-repair', ' Craft-repair', ' Sales', ' Other-service', ' Prof-specialty', ' Tech-support', ' Transport-moving', ' Other-service', ' Other-service', ' Craft-repair', 'unknown', ' Adm-clerical', ' Adm-clerical', ' Exec-managerial', ' Craft-repair', 'unknown', ' Craft-repair', ' Handlers-cleaners', ' Sales', ' Craft-repair', ' Other-service', 'unknown', ' Other-service', ' Exec-managerial', ' Exec-managerial', ' Sales', ' Other-service', ' Exec-managerial', ' Protective-serv', ' Handlers-cleaners', ' Prof-specialty', ' Other-service', ' Protective-serv', ' Sales', ' Craft-repair', ' Prof-specialty', ' Sales', ' Craft-repair', ' Handlers-cleaners', ' Other-service', ' Prof-specialty', ' Exec-managerial', ' Adm-clerical', ' Craft-repair', ' Machine-op-inspct', ' Adm-clerical', ' Adm-clerical', ' Exec-managerial')
occ3 <- c( 'unknown', ' Prof-specialty', ' Prof-specialty', ' Machine-op-inspct', ' Machine-op-inspct', ' Craft-repair', ' Tech-support', ' Tech-support', ' Transport-moving', ' Craft-repair', ' Exec-managerial', ' Prof-specialty', ' Sales', ' Prof-specialty', 'unknown', ' Exec-managerial', ' Prof-specialty', ' Adm-clerical', ' Sales', ' Other-service', ' Craft-repair', ' Sales', ' Sales', ' Transport-moving', ' Craft-repair', ' Sales', ' Craft-repair', ' Machine-op-inspct', ' Exec-managerial', ' Sales', ' Sales', ' Prof-specialty', ' Craft-repair', ' Handlers-cleaners', 'unknown', ' Other-service', ' Adm-clerical', ' Machine-op-inspct', ' Sales', 'unknown', ' Farming-fishing', ' Adm-clerical', ' Adm-clerical', ' Transport-moving', ' Sales', ' Adm-clerical', ' Craft-repair', ' Prof-specialty', ' Other-service', ' Adm-clerical', ' Exec-managerial', ' Exec-managerial', ' Transport-moving', ' Prof-specialty', ' Other-service', ' Protective-serv', 'unknown', ' Craft-repair', ' Adm-clerical', ' Adm-clerical', ' Other-service', ' Tech-support', ' Adm-clerical')
occupation <- stringr::str_trim(c(occ1, occ2, occ3))

adult_incomes <- data.frame(gender, occupation, stringsAsFactors = FALSE)
glimpse(adult_incomes)
## Observations: 250
## Variables: 2
## $ gender     <chr> "Male", "Male", "Male", "Male", "Female", "Female",...
## $ occupation <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleane...
# Group the data and create a summary of the counts
adult_incomes %>%
    count(occupation, gender) %>%
    # Create a grouped bar graph
    ggplot(., aes(x=occupation, y=n, fill=gender)) +
    geom_bar(stat="identity", position="dodge") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Create a table of the variables of interest
adult_incomes %>% 
    select(gender, occupation) %>% 
    table()
##         occupation
## gender   Adm-clerical Craft-repair Exec-managerial Farming-fishing
##   Female           15            3              11               0
##   Male             12           28              19               4
##         occupation
## gender   Handlers-cleaners Machine-op-inspct Other-service Prof-specialty
##   Female                 0                 4            12             10
##   Male                  11                11            15             22
##         occupation
## gender   Protective-serv Sales Tech-support Transport-moving unknown
##   Female               0    14            3                1       5
##   Male                 6    20            8                7       9
# Create a feature cross between gender and occupation
dmy <- dummyVars(~ gender:occupation, data=adult_incomes)

# Create object of your resulting data frame
oh_data <- predict(dmy, adult_incomes)

# Summarize the resulting output
summary(oh_data)
##  genderFemale:occupationAdm-clerical genderMale:occupationAdm-clerical
##  Min.   :0.00                        Min.   :0.000                    
##  1st Qu.:0.00                        1st Qu.:0.000                    
##  Median :0.00                        Median :0.000                    
##  Mean   :0.06                        Mean   :0.048                    
##  3rd Qu.:0.00                        3rd Qu.:0.000                    
##  Max.   :1.00                        Max.   :1.000                    
##  genderFemale:occupationCraft-repair genderMale:occupationCraft-repair
##  Min.   :0.000                       Min.   :0.000                    
##  1st Qu.:0.000                       1st Qu.:0.000                    
##  Median :0.000                       Median :0.000                    
##  Mean   :0.012                       Mean   :0.112                    
##  3rd Qu.:0.000                       3rd Qu.:0.000                    
##  Max.   :1.000                       Max.   :1.000                    
##  genderFemale:occupationExec-managerial
##  Min.   :0.000                         
##  1st Qu.:0.000                         
##  Median :0.000                         
##  Mean   :0.044                         
##  3rd Qu.:0.000                         
##  Max.   :1.000                         
##  genderMale:occupationExec-managerial
##  Min.   :0.000                       
##  1st Qu.:0.000                       
##  Median :0.000                       
##  Mean   :0.076                       
##  3rd Qu.:0.000                       
##  Max.   :1.000                       
##  genderFemale:occupationFarming-fishing
##  Min.   :0                             
##  1st Qu.:0                             
##  Median :0                             
##  Mean   :0                             
##  3rd Qu.:0                             
##  Max.   :0                             
##  genderMale:occupationFarming-fishing
##  Min.   :0.000                       
##  1st Qu.:0.000                       
##  Median :0.000                       
##  Mean   :0.016                       
##  3rd Qu.:0.000                       
##  Max.   :1.000                       
##  genderFemale:occupationHandlers-cleaners
##  Min.   :0                               
##  1st Qu.:0                               
##  Median :0                               
##  Mean   :0                               
##  3rd Qu.:0                               
##  Max.   :0                               
##  genderMale:occupationHandlers-cleaners
##  Min.   :0.000                         
##  1st Qu.:0.000                         
##  Median :0.000                         
##  Mean   :0.044                         
##  3rd Qu.:0.000                         
##  Max.   :1.000                         
##  genderFemale:occupationMachine-op-inspct
##  Min.   :0.000                           
##  1st Qu.:0.000                           
##  Median :0.000                           
##  Mean   :0.016                           
##  3rd Qu.:0.000                           
##  Max.   :1.000                           
##  genderMale:occupationMachine-op-inspct
##  Min.   :0.000                         
##  1st Qu.:0.000                         
##  Median :0.000                         
##  Mean   :0.044                         
##  3rd Qu.:0.000                         
##  Max.   :1.000                         
##  genderFemale:occupationOther-service genderMale:occupationOther-service
##  Min.   :0.000                        Min.   :0.00                      
##  1st Qu.:0.000                        1st Qu.:0.00                      
##  Median :0.000                        Median :0.00                      
##  Mean   :0.048                        Mean   :0.06                      
##  3rd Qu.:0.000                        3rd Qu.:0.00                      
##  Max.   :1.000                        Max.   :1.00                      
##  genderFemale:occupationProf-specialty genderMale:occupationProf-specialty
##  Min.   :0.00                          Min.   :0.000                      
##  1st Qu.:0.00                          1st Qu.:0.000                      
##  Median :0.00                          Median :0.000                      
##  Mean   :0.04                          Mean   :0.088                      
##  3rd Qu.:0.00                          3rd Qu.:0.000                      
##  Max.   :1.00                          Max.   :1.000                      
##  genderFemale:occupationProtective-serv
##  Min.   :0                             
##  1st Qu.:0                             
##  Median :0                             
##  Mean   :0                             
##  3rd Qu.:0                             
##  Max.   :0                             
##  genderMale:occupationProtective-serv genderFemale:occupationSales
##  Min.   :0.000                        Min.   :0.000               
##  1st Qu.:0.000                        1st Qu.:0.000               
##  Median :0.000                        Median :0.000               
##  Mean   :0.024                        Mean   :0.056               
##  3rd Qu.:0.000                        3rd Qu.:0.000               
##  Max.   :1.000                        Max.   :1.000               
##  genderMale:occupationSales genderFemale:occupationTech-support
##  Min.   :0.00               Min.   :0.000                      
##  1st Qu.:0.00               1st Qu.:0.000                      
##  Median :0.00               Median :0.000                      
##  Mean   :0.08               Mean   :0.012                      
##  3rd Qu.:0.00               3rd Qu.:0.000                      
##  Max.   :1.00               Max.   :1.000                      
##  genderMale:occupationTech-support genderFemale:occupationTransport-moving
##  Min.   :0.000                     Min.   :0.000                          
##  1st Qu.:0.000                     1st Qu.:0.000                          
##  Median :0.000                     Median :0.000                          
##  Mean   :0.032                     Mean   :0.004                          
##  3rd Qu.:0.000                     3rd Qu.:0.000                          
##  Max.   :1.000                     Max.   :1.000                          
##  genderMale:occupationTransport-moving genderFemale:occupationunknown
##  Min.   :0.000                         Min.   :0.00                  
##  1st Qu.:0.000                         1st Qu.:0.00                  
##  Median :0.000                         Median :0.00                  
##  Mean   :0.028                         Mean   :0.02                  
##  3rd Qu.:0.000                         3rd Qu.:0.00                  
##  Max.   :1.000                         Max.   :1.00                  
##  genderMale:occupationunknown
##  Min.   :0.000               
##  1st Qu.:0.000               
##  Median :0.000               
##  Mean   :0.036               
##  3rd Qu.:0.000               
##  Max.   :1.000
# Create the df
poke_x <- poke_df %>% 
    select(hp, attack, defense, spatk, spdef, speed)

# Perform PCA 
poke_pca <- prcomp(poke_x, center=TRUE, scale=TRUE)


# Calculate the proportion of variance
prop_var <- data.frame(sdev=poke_pca$sdev)
prop_var <- prop_var %>%
    mutate(pca_comp = 1:n(), pcVar = sdev^2, propVar_ex = pcVar/sum(pcVar))


# Create a plot of the components and proportion of variance
ggplot(prop_var, aes(pca_comp, propVar_ex, group=1)) + 
    geom_line() +
    geom_point()

# Create a plot of the first two components
library(ggfortify)
autoplot(poke_pca, data = poke_df)


Introduction to Text Analysis in R

Chapter 1 - Wrangling Text

Text as Data:

  • The tidyverse tools work well together and can be leveraged for text analysis
    • library(tidyverse)
    • review_data <- read_csv(“myFile.csv”)
  • Can use group_by() and summarize() to get key metrics by category
    • review_data %>% group_by(product) %>% summarize(mean(stars, na.rm=TRUE))

Counting Categorical Data:

  • The standard print() command for the tibble shows each column and its data type
  • Can get count data using n() or count()
    • review_data %>% group_by(myGroup) %>% summarize(n = n())
    • review_data %>% count(myGroup) %>% arrange(desc(n)) # default name for the column created is n

Tokenizing and Cleaning:

  • The tidytext package provides a suite of powerful tools for structuring and analyzing text
  • Some of the NLP (natural language processing) vocabulary includes:
    • Bag of words - assumes each word is independent
    • Each separate body of text is a document
    • Each unique word is a term
    • Every occurrence of a term is known as a token
  • Can use unnest_tokens() to create tokens
    • tidy_review <- review_data %>% unnest_tokens(word, review) # create column word by tokenizing column review; also cleans punctuation, casing, white-space, etc.
    • tidy_review %>% count(word) %>% arrange(desc(n))
  • Stop words are common and non-informative words - can use anti_join to eliminate these
    • tidy_review %>% anti_join(stop_words) %>% count(word) %>% arrange(desc(n))

Example code includes:

twitter_data <- readRDS("./RInputFiles/ch_1_twitter_data.rds")
glimpse(twitter_data)


# Print twitter_data
twitter_data

# Print just the complaints in twitter_data
twitter_data %>% 
    filter(complaint_label == "Complaint")


# Start with the data frame
twitter_data %>% 
    # Group the whether or not the tweet is a complaint
    group_by(complaint_label) %>% 
    # Compute the mean, min, and max follower counts
    summarize(avg_followers = mean(usr_followers_count),
              min_followers = min(usr_followers_count),
              max_followers = max(usr_followers_count)
              )


twitter_data %>% 
    # Filter for just the complaints
    filter(complaint_label == "Complaint") %>% 
    # Count the number of verified and non-verified users
    count(usr_verified)


twitter_data %>% 
    # Group by whether or not a user is verified
    group_by(usr_verified) %>% 
    summarize(
        # Compute the average number of followers
        avg_followers = mean(usr_followers_count),
        # Count the number of users in each category
        n = n()
        )


tidy_twitter <- twitter_data %>% 
    # Tokenize the twitter data
    tidytext::unnest_tokens(word, tweet_text) 

tidy_twitter %>% 
    # Compute word counts
    count(word) %>% 
    # Arrange the counts in descending order
    arrange(desc(n))


tidy_twitter <- twitter_data %>% 
    # Tokenize the twitter data
    tidytext::unnest_tokens(word, tweet_text) %>% 
    # Remove stop words
    anti_join(tidytext::stop_words)

tidy_twitter %>% 
    # Filter to keep complaints only
    filter(complaint_label == "Complaint") %>% 
    # Compute word counts and arrange in descending order
    count(word) %>% 
    arrange(desc(n))

Chapter 2 - Visualizing Text

Plotting Word Counts:

  • Can get the row numbers using row_number()
    • myData %>% mutate(id=row_number()) %>% .
  • Can generate a bar plot to look at the results
    • word_counts <- tidy_review %>% count(word) %>% arrange(desc(n))
    • ggplot(word_counts, aes(x=word, y=n)) + geom_col() # many problems with this chart
  • Should run filters to get only the key words of interest
    • Cutoffs for number of usages
    • Flipping coordinates with coord_flip()
    • Adding a title with ggtitle(“myTitle”)

Improving Word Count Plots:

  • Often have additional custom stop words that should be added to the defaults
  • Can use tribble() to set up a tibble
    • custom_stop <- tribble(~colName1, ~colName2, “col1Row1Data”, “col2Row1Data”, “col1Row2Data”, “col2Row2Data”, .)
    • stop_words2 <- stop_words %>% bind_rows(custom_stop)
  • The arrange function does not impact the plots - need to reorder the levels of the factors
    • mutate(word2 = fct_reorder(word, n)) # reorder column word by n

Faceting Word Count Plots:

  • Can count words by produce to see where the primary words are being used
    • tidy_review %>% count(word, product) %>% arrange(desc(n)) %>% top_n(10, n) %>% ungroup() # keep the top 10 items based on n
  • Can use fct_reorder() as per previous and then plot using facet_wrap()
    • ggplot(word_counts, aes(x=word2, y=n, fill=product)) + geom_col(show.legend=FALSE) + facet_wrap(~ product, scales = “free_y”) + coord_flip() + ggtitle(“myTitle”)

Plotting Word Clouds:

  • The wordcloud() package can be used to make word clouds
    • library(wordcloud)
    • word_counts <- tidy_review %>% count(word)
    • wordcloud(words = word_counts\(word, freq=word_counts\)n, max.words=30, colors=“blue”)
  • Location of words in a cloud is random, while size of the word is based on word frequency
  • Need to be careful not to include too many words in the max.words (cloud may run off the screen)

Example code includes:

word_counts <- tidy_twitter %>% 
    filter(complaint_label == "Complaint") %>% 
    count(word) %>% 
    # Keep words with count greater than 100
    filter(n > 100)

# Create a bar plot using word_counts
ggplot(word_counts, aes(x=word, y=n)) +
    geom_col() +
    # Flip the plot coordinates
    coord_flip()


word_counts <- tidy_twitter %>% 
    # Only keep the non-complaints
    filter(complaint_label == "Non-Complaint") %>% 
    count(word) %>% 
    filter(n > 150)

# Create a bar plot using the new word_counts
ggplot(word_counts, aes(x=word, y=n)) +
    geom_col() +
    coord_flip() +
    # Title the plot "Non-Complaint Word Counts"
    ggtitle("Non-Complaint Word Counts")


custom_stop_words <- tribble(
    # Column names should match stop_words
    ~word, ~lexicon,
    # Add http, win, and t.co as custom stop words
    "http", "CUSTOM",
    "win", "CUSTOM",
    "t.co", "CUSTOM"
)

# Bind the custom stop words to stop_words
stop_words2 <- tidytext::stop_words %>% 
    bind_rows(custom_stop_words)


word_counts <- tidy_twitter %>% 
    filter(complaint_label == "Non-Complaint") %>% 
    count(word) %>% 
    # Keep terms that occur more than 100 times
    filter(n > 100) %>% 
    # Reorder word as an ordered factor by word counts
    mutate(word2 = fct_reorder(word, n))

# Plot the new word column with type factor
ggplot(word_counts, aes(x=word2, y=n)) +
    geom_col() +
    coord_flip() +
    ggtitle("Non-Complaint Word Counts")


word_counts <- tidy_twitter %>%
    # Count words by whether or not its a complaint
    count(word, complaint_label) %>%
    # Group by whether or not its a complaint
    group_by(complaint_label) %>%
    # Keep the top 20 words
    top_n(20, n) %>%
    # Ungroup before reordering word as a factor by the count
    ungroup() %>%
    mutate(word2 = fct_reorder(word, n))


# Include a color aesthetic tied to whether or not its a complaint
ggplot(word_counts, aes(x = word2, y = n, fill = complaint_label)) +
    # Don't include the lengend for the column plot
    geom_col(show.legend = FALSE) +
    # Facet by whether or not its a complaint and make the y-axis free
    facet_wrap(~ complaint_label, scales = "free_y") +
    # Flip the coordinates and add a title: "Twitter Word Counts"
    coord_flip() +
    ggtitle("Twitter Word Counts")


# Compute word counts and assign to word_counts
word_counts <- tidy_twitter %>% 
    count(word)

wordcloud::wordcloud(
    # Assign the word column to words
    words = word_counts$word, 
    # Assign the count column to freq
    freq = word_counts$n,
    max.words = 30
)


# Compute complaint word counts and assign to word_counts
word_counts <- tidy_twitter %>% 
    filter(complaint_label=="Complaint") %>% 
    count(word)

# Create a complaint word cloud of the top 50 terms, colored red
wordcloud::wordcloud(
    words = word_counts$word, 
    freq = word_counts$n, 
    max.words = 50, 
    colors = "red"
)

Chapter 3 - Sentiment Analysis

Sentiment Dictionaries:

  • Can use the get_sentiments() function to access sentiment dictionaries
    • get_sentiments(“bing”) %>% count(sentiment)
    • get_sentiments(“afinn”) %>% summarize(min(score), max(score))
    • get_sentiments(“loughran”) %>% count(sentiment) %>% mutate(sentiment2 = fct_reorder(sentiment, n)) # multiple types of sentiments, and words can have more than one

Appending Dictionaries:

  • Can append sentiment dictionaries using inner_join()
    • tidy_review %>% inner_join(get_sentiments(“loughran”))
  • May want to visualize the most common words by sentiment
    • sentiment_review2 <- sentiment_review %>% filter(sentiment %in% c(“positive”, “negative”))
    • word_counts <- sentiment_review2 %>% count(word, sentiment) %>% group_by(sentiment) %>% top(10, n) %>% ungroup() %>% mutate(word2 = fct_reorder(word, n))
    • ggplot(word_counts, aes(x=word2, y=n, fill=sentiment)) + geom_col(show.legend=FALSE) + facet_wrap(~ sentiment, scales = “free”) + coord_flip() + labs(title = “myTitle”, x = “Words”)

Improving Sentiment Analysis:

  • Can use the tidyr::spread() function to convert tidy data to wide and short
    • tidy_review %>% inner_join(get_sentiments(“bing”)) %>% count(stars, sentiment) %>% spread(sentiment, n)

Example code includes:

# Count the number of words associated with each sentiment in nrc
tidytext::get_sentiments("bing") %>% 
    count(sentiment) %>% 
    # Arrange the counts in descending order
    arrange(-n)


# Pull in the nrc dictionary, count the sentiments and reorder them by count
sentiment_counts <- tidytext::get_sentiments("bing") %>% 
    count(sentiment) %>% 
    mutate(sentiment2 = fct_reorder(sentiment, n))

# Visualize sentiment_counts using the new sentiment factor column
ggplot(sentiment_counts, aes(x=sentiment2, y=n)) +
    geom_col() +
    coord_flip() +
    # Change the title to "Sentiment Counts in NRC", x-axis to "Counts", and y-axis to "Sentiment"
    labs(title = "Sentiment Counts in NRC", x = "Counts", y = "Sentiment")


# Join tidy_twitter and the NRC sentiment dictionary
sentiment_twitter <- tidy_twitter %>% 
    inner_join(tidytext::get_sentiments("bing"))

# Count the sentiments in tidy_twitter
sentiment_twitter %>% 
    count(sentiment) %>% 
    # Arrange the sentiment counts in descending order
    arrange(-n)


word_counts <- tidy_twitter %>% 
    # Append the NRC dictionary and filter for positive, fear, and trust
    inner_join(tidytext::get_sentiments("bing")) %>% 
    filter(sentiment %in% c("positive", "fear", "trust")) %>%
    # Count by word and sentiment and keep the top 10 of each
    count(word, sentiment) %>% 
    group_by(sentiment) %>% 
    top_n(10, n) %>% 
    ungroup() %>% 
    # Create a factor called word2 that has each word ordered by the count
    mutate(word2 = fct_reorder(word, n))

# Create a bar plot out of the word counts colored by sentiment
ggplot(word_counts, aes(x=word2, y=n, fill=sentiment)) +
    geom_col(show.legend = FALSE) +
    # Create a separate facet for each sentiment with free axes
    facet_wrap(~ sentiment, scales = "free") +
    coord_flip() +
    # Title the plot "Sentiment Word Counts" with "Words" for the x-axis
    labs(title = "Sentiment Word Counts", x = "Words")


tidy_twitter %>% 
    # Append the NRC sentiment dictionary
    inner_join(tidytext::get_sentiments("bing")) %>% 
    # Count by complaint label and sentiment
    count(complaint_label, sentiment) %>% 
    # Spread the sentiment and count columns
    spread(sentiment, n)


tidy_twitter %>% 
    # Append the afinn sentiment dictionary
    inner_join(tidytext::get_sentiments("afinn")) %>% 
    # Group by both complaint label and whether or not the user is verified
    group_by(complaint_label, usr_verified) %>% 
    # Summarize the data with an aggregate_score = sum(score)
    summarize(aggregate_score = sum(value)) %>% 
    # Spread the complaint_label and aggregate_score columns
    spread(complaint_label, aggregate_score) %>% 
    mutate(overall_sentiment = Complaint + `Non-Complaint`)


sentiment_twitter <- tidy_twitter %>% 
    # Append the bing sentiment dictionary
    inner_join(tidytext::get_sentiments("bing")) %>% 
    # Count by complaint label and sentiment
    count(complaint_label, sentiment) %>% 
    # Spread the sentiment and count columns
    spread(sentiment, n) %>% 
    # Compute overall_sentiment = positive - negative
    mutate(overall_sentiment = positive - negative)

# Create a bar plot out of overall sentiment by complaint level, colored by a complaint label factor
ggplot(sentiment_twitter, aes(x=complaint_label, y=overall_sentiment, fill=as.factor(complaint_label))) +
    geom_col(show.legend = FALSE) +
    coord_flip() + 
    # Title the plot "Overall Sentiment by Complaint Type," with an "Airline Twitter Data" subtitle
    labs(title = "Overall Sentiment by Complaint Type", subtitle = "Airline Twitter Data")

Chapter 4 - Topic Modeling

Latent Dirichlet Allocation:

  • Can assess the underlying topics of a document using LDA (Latent Dirichlet Alloaction)
    • Collections of documents are known as a corpus, each with its own bag of words
    • LDA finds patterns of words that tend to appear together (unsupervised learning)
  • Comparisons between clustering and topic modeling
    • Clusters are based on distances, which are continuous (every item is a full member of a single segment)
    • Topic modeling is based on word counts, which are discrete (every document is a partial member of every topic)

Document Term Matrices:

  • The DTM (document-term-matrix) is the building block for the analysis
    • DTM are especially useful for sparse matrices (matrices where 0 is the most common element)
    • tidy_review %>% count(word, id) %>% cast_dtm(id, word, n) # document, word, count
  • Can use as.matrix() to view a sparse matrix - generally should subset for easier viewing

Running Topic Models:

  • Can use the topicmodels::LDA() to run topic modeling
    • library(topicmodels)
    • lda_out <- LDA(dtm_review, k=2, method=“Gibbs”, control=list(seed=42)) # k=2 will create 2 topics (hyper-parameter for tuning)
    • glimpse(lda_out)
    • lda_out %>% tidy(matrix=“beta”) %>% arrange(-beta)

Interpreting Topics:

  • Key is to find distinct but non-repetitive topics
  • Takes user-expertise to examine the top words by topic and assign a label
    • Adding a topic can dramatically change the solution
    • Can change the number of topics until the segments appear distinct and non-reptitive

Wrap Up:

  • Tokenizing text and removing stop words
  • Visualizing word counts
  • Conducting sentiment analysis
  • Running and interpreting topic models

Example code includes:

excl_words <- c("t.co", "http", "klm", "united", "americanair", "delta", "de", "southwestair", "usairways", 
                "jetblue", "british_airways", "amp", "deltaassist", "2", "ryanair", "4", "en", 
                "aircanada", "el", "emirates", "3", "virginamerica", "alaskaair", "1", "es", "vueling", 
                "britishairways", "se", "indonesiagaruda", "airfrancefr", "nedmex", "turkishairlines", 
                "airasia", "20", "flyfrontier", "tamairlines", "5", "30", "6", "10", "taylorcaniff",
                "dm", "frontiercare", "ik"
                )

# Cast the word counts by tweet into a DTM
dtm_twitter <- tidy_twitter %>% 
    filter(!(word %in% excl_words)) %>%
    count(word, tweet_id) %>% 
    tidytext::cast_dtm(tweet_id, word, n)


# Run an LDA with 2 topics and a Gibbs sampler
lda_out2 <- topicmodels::LDA(dtm_twitter, k = 2, method = "Gibbs", control = list(seed = 42))

# Tidy the matrix of word probabilities
lda_topics <- lda_out2 %>% 
    broom::tidy(matrix = "beta")


# Print the output from LDA run
lda_topics

# Start with the topics output from the LDA run
lda_topics %>% 
    # Arrange the topics by word probabilities in descending order
    arrange(-beta)


# Produce a grouped summary of the LDA output by topic
lda_topics %>% 
    group_by(topic) %>% 
    summarize(
        # Calculate the sum of the word probabilities
        sum = sum(beta),
        # Count the number of terms
        n = n()
        )


word_probs <- lda_topics %>%
    # Keep the top 10 highest word probabilities by topic
    group_by(topic) %>% 
    top_n(10, beta) %>% 
    ungroup() %>%
    # Create term2, a factor ordered by word probability
    mutate(term2 = fct_reorder(term, beta))

# Plot term2 and the word probabilities
ggplot(word_probs, aes(x=term2, y=beta)) +
    geom_col() +
    # Facet the bar plot by topic
    facet_wrap(~topic, scales = "free") +
    coord_flip()


# Start with the tidied Twitter data
tidy_twitter %>% 
    filter(!(word %in% excl_words)) %>%
    # Count each word used in each tweet
    count(word, tweet_id) %>% 
    # Use the word counts by tweet to create a DTM
    tidytext::cast_dtm(tweet_id, word, n)


# Assign the DTM to dtm_twitter
dtm_twitter <- tidy_twitter %>% 
    filter(!(word %in% excl_words)) %>%
    count(word, tweet_id) %>% 
    # Cast the word counts by tweet into a DTM
    tidytext::cast_dtm(tweet_id, word, n)

# Coerce dtm_twitter into a matrix called matrix_twitter
matrix_twitter <- as.matrix(dtm_twitter)

# Print rows 1 through 5 and columns 90 through 95
matrix_twitter[1:5, 90:95]


# Run an LDA with 2 topics and a Gibbs sampler
lda_out <- topicmodels::LDA(dtm_twitter, k = 2, method = "Gibbs", control = list(seed = 42))

# Glimpse the topic model output
glimpse(lda_out)

# Tidy the matrix of word probabilities
lda_topics <- lda_out %>% 
    broom::tidy(matrix="beta")

# Arrange the topics by word probabilities in descending order
lda_topics %>% 
    arrange(-beta)


# Run an LDA with 3 topics and a Gibbs sampler
lda_out2 <- topicmodels::LDA(dtm_twitter, k = 3, method = "Gibbs", control = list(seed = 42))

# Tidy the matrix of word probabilities
lda_topics2 <- lda_out2 %>% 
    broom::tidy(matrix = "beta")

# Arrange the topics by word probabilities in descending order
lda_topics2 %>% 
    arrange(-beta)


# Select the top 15 terms by topic and reorder term
word_probs2 <- lda_topics2 %>% 
    group_by(topic) %>% 
    top_n(15, beta) %>% 
    ungroup() %>%
    mutate(term2 = fct_reorder(term, beta))

# Plot word_probs2, color and facet based on topic
ggplot(word_probs2, aes(x=term2, y=beta, fill=as.factor(topic))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ topic, scales = "free") +
    coord_flip()


# Run an LDA with 4 topics and a Gibbs sampler
lda_out3 <- topicmodels::LDA(dtm_twitter, k = 4, method = "Gibbs", control = list(seed = 42))

# Tidy the matrix of word probabilities
lda_topics3 <- lda_out3 %>% 
    broom::tidy(matrix = "beta")

# Arrange the topics by word probabilities in descending order
lda_topics3 %>% 
    arrange(-beta)


# Select the top 15 terms by topic and reorder term
word_probs3 <- lda_topics3 %>% 
    group_by(topic) %>% 
    top_n(15, beta) %>% 
    ungroup() %>%
    mutate(term2 = fct_reorder(term, beta))

# Plot word_probs3, color and facet based on topic
ggplot(word_probs3, aes(x=term2, y=beta, fill=as.factor(topic))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ topic, scales = "free") +
    coord_flip()

Survey and Measure Development in R

Chapter 1 - Preparing to Analyze Survey Data

Surveys in Marketing Research:

  • Surveys can consist of items rated on the Likert scale (commonly, but not necessarily, 1-5 scales)
    • Step 1 - item generation (expert review, SME, etc.)
    • Step 2 - questionnaire administration
    • Setp 3 - initial item reduction
    • Step 4 - confirmatory factor analysis
    • Step 5 - Convergent/Discriminant Validity
    • Step 6 - Replication
  • Example for inter-rater reliability
    • library(irr)
    • agree(experts)
    • cohen.kappa(experts) # 0-0.4 is poor, 0.4-0.6 is mediocre, 0.6-0.8 is substantial, 0.8+ is very strong
  • Example for content validity ratios
    • Lawshe’s Content Validity Ratio (CVR): what percent of experts judge that item essential to what’s being measured
    • CVR = [ E - (N/2) ] / (N/2) where N is the total number of experts and E is the number who rated the item as essential
    • psychometric::CVratio(NTOTAL=, NESSENTIAL=) # -1 is consensus againt, +1 is consensus in favor

Measurement, Validity, and Reliability:

  • Measurement is the process of observing and recording events
    • Requires a measurement device and a calibration standard
  • Reliability can be assessed on three prongs
    • Equivalence (inter-rater)
    • Internal consistency (coefficient alpha, split-half)
    • Stability (test-retest)
  • Validity checks whether measurments are as-claimed
    • Content
    • Construct (convergent, discriminant)
    • Criterion (concurrent, predictive)
  • Exploratory Data Analysis can be considered as step 2.5 - between questionnaire administration and initial item reduction
    • c_sat_likert <- c_sat %>% mutate_if(is.integer, as.factor) %>% likert()
    • plot(c_sat_likert)
  • Can use EDA to check for items that are reverse coded (where 5 would be bad and 1 would be good)
    • car::recode(myVar, “1=5; 2=4; 3=3; 4=2; 5=1”)

Describing Survey Results:

  • Generally, if there are less than 5% missing values and with equal distribution, then just omit them
    • Hmisc::naclus(bad_survey))
  • Item correlations can be valuable - high correlations with each other, but not outside their own group
    • corr.test(myDF)
    • corrplot(cor(myDF), method=“circle”)

Example code includes:

file001 <- readr::read_csv("./RInputFiles/brandrep-cleansurvey-extraitem.csv")
file002 <- readr::read_csv("./RInputFiles/brandquall11-recodedbutextraitem.csv")
file003 <- readr::read_csv("./RInputFiles/customersatisfactionclean.xls")
file004 <- readr::read_csv("./RInputFiles/brandloyalty.xls")


sme <- data.frame(Rater_A=c(1, 2, 3, 2, 1, 1, 1, 2, 3, 3, 2, 1, 1), 
                  Rater_B=c(1, 2, 2, 3, 3, 1, 1, 1, 2, 3, 3, 3, 1)
                  )


# Print beginning of sme data frame
head(sme)

# Correlation matrix of expert ratings
cor(sme)

# Percentage agreement of experts
irr::agree(sme)


# Check inter-rater reliability
psych::cohen.kappa(sme)

# While our Cohen's kappa and Pearson correlation happen to be similar in value, these are not measuring the same thing
# We are interested in agreement between the pairs of expert ratings on each item rather than a linear relationship between the item ratings in total
# In the next exercise, we'll look at content validity
# This is a measure of the assessment by a panel of experts (not just two, as in in Cohen's kappa) about the strength of an individual item


lawshe <- data.frame(item=rep(1:5, each=3), 
                     expert=rep(LETTERS[1:3], times=5), 
                     rating=factor(c("Essential", "Useful", "Not necessary", "Useful", "Not necessary", 
                                     "Useful",  "Not necessary", "Not necessary", "Essential", "Essential", 
                                     "Useful", "Essential", "Essential", "Essential", "Essential"
                                     )
                                   )
                     )


# Calculate the CVR for each unique item in the data frame
lawshe %>% 
    group_by(item) %>% 
    summarize(CVR = psychometric::CVratio(NTOTAL = length(unique(expert)), 
                                          NESSENTIAL = sum(rating == 'Essential')
                                          )
              )


brand_rep <- file001
glimpse(brand_rep)


# Convert items to factor
b_rep_likert <- brand_rep %>% 
    mutate(poor_workman_r=6-poor_workman_r) %>%
    mutate_if(is.double, as.factor) %>%
    as.data.frame()

# Response frequencies - base R
summary(b_rep_likert)

# Plot response frequencies
result <- likert::likert(b_rep_likert)
plot(result)


brand_qual <- file002 %>%
    mutate(tired=6-tired_r) %>%
    select(-innovator) %>%
    as.data.frame()
glimpse(brand_qual)

brand_qual_items <- c('trendy = This brand is trendy.', 'latest = This brand offers the latest products.', 'tired = This is a tired brand.', "happy_pay = I am happy paying what I do for this brand's products.", "reason_price = This brand's products are reasonably priced.", "good_deal = This brand's products are a good deal.", "strong_perform = This brand's products are strong performers.", 'leader = This brand is a leader in its field.', 'serious = This brand takes its product quality seriously.')
brand_qual_items


# Get response frequencies from psych
psych::response.frequencies(brand_qual)

# Print item descriptions
brand_qual_items

# Reverse code the "opposite" item
brand_qual$tired_r <- car::recode(brand_qual$tired, "1 = 5; 2 = 4; 4 = 2; 5 = 1")

# Check recoding frequencies
brand_qual %>% 
    select(tired, tired_r) %>%
    psych::response.frequencies() %>%
    round(2)


missing_lots <- file002 %>%
    mutate(tired=6-tired_r) %>%
    select(-tired_r)
set.seed(1908181013)
naRow <- sample(1:nrow(missing_lots), round(0.2*nrow(missing_lots)*ncol(missing_lots)), replace=TRUE)
naCol <- sample(1:ncol(missing_lots), round(0.2*nrow(missing_lots)*ncol(missing_lots)), replace=TRUE)
for (j in seq_along(naRow)) { missing_lots[naRow[j], naCol[j]] <- NA }
glimpse(missing_lots)
    

# Total number of rows
nrow(missing_lots)

# Total number of incomplete cases
sum(!complete.cases(missing_lots))

# Number of incomplete cases by variable
colSums(is.na(missing_lots))

# Hierarchical plot -- what values are missing together?
plot(Hmisc::naclus(missing_lots))


brand_qual_9 <- file002
glimpse(brand_qual_9)


# View significance of item correlations
psych::corr.test(brand_qual_9)

# Visualize item correlations -- corrplot
corrplot::corrplot(cor(brand_qual_9), method = "circle")


b_rep_items <- c("well_made: Crunchola's products are well-made.", 'consistent: Crunchola offers consistently high-quality products.', 'poor_workman: Crunchola suffers from poor workmanship in its products.', 'higher_price: I am willing to pay a higher price for Crunchola products.', 'lot_more: I am willing to pay a lot more for Crunchola products.', 'go_up: The price of Crunchola products would have to go up quite a bit before I would switch to another brand.', "stands_out: Crunchola's brand really stands out from its competitors.", "unique: Crunchola's brand is unique from other brands.", "one_of_a_kind: Crunchola's brand is truly one of a kind.")
b_rep_items


brand_rep_9 <- file001 %>%
    mutate(poor_workman = 6-poor_workman_r) %>%
    select(-poor_workman_r) %>%
    as.data.frame()
glimpse(brand_rep_9)


# Get response frequencies
psych::response.frequencies(brand_rep_9)

# Recode the appropriate item 
brand_rep_9$poor_workman_r <- car::recode(brand_rep_9$poor_workman, "1 = 5; 2 = 4; 4 = 2; 5 = 1")

# Adjust brandrep 9 dataset
brand_rep_9_new <- select(brand_rep_9, -poor_workman)

# Visualize item correlation
corrplot::corrplot(cor(brand_rep_9_new), method = "circle")

Chapter 2 - Exploratory Factor Analysis and Survey Development

Latent Variables:

  • Latent variables are inferred from manifest variables - for example, brand loyalty
  • Parsimony is a general goal of survey development - how do the manifest variables help identify the latent variables
    • psych::fa.parallel(myDF) # scree of dataset against random dataset
    • myEFA <- psych::fa(myDF, nfactors=3)
    • myEFA$loadings
    • psych::scree(myDF)

EFA and Item Refinement:

  • Factor loadings are a valuable statistic - relationship between manifest variables and latent variables
    • Ideally, only one latent variable per manifest variables
    • 0 - 0.4 are poor
    • 0.7+ are strong
    • 0.4 - 0.7 are equivocal
  • What makes for a strong EFA?
    • c_sat_11_EFA_3$e.values # check that the number of eigenvalues >1 is the same as the number of factors (rule of thumb)
    • Factor score correlations of 0.6 and under are “not too similar”
    • myEFA$score.cor
  • If results are not favorable can either 1) drop poorly performing items, or 2) revist the number of factors

Assessing Internal Reliability:

  • Internal consistency is a measure of survey reliability - consistency within itself
  • Split-half reliability checks whether all parts of the survey contribute equally
    • psych::splitHalf(mySurvey) # generally, 0.8+ indicates internal reliability
  • Coefficient (Cornbach) alpha measures the consistency of measures of the construct
    • psych::alpha(mySurvey) # std.alpha is considered a more reliable metric due to standardization
    • Target is 0.8 - 0.9 with 0.7 - 0.8 also respectable and 0.65 - 0.7 minimally acceptable
    • Values > 0.9 may suggest collinearity problems; drop items
    • Values < 0.65 are undesirable and/or unacceptable; drop items as they may not be measuring the same construct

Example code includes:

b_loyal_10 <- file004
glimpse(b_loyal_10)


# Print correlation matrix
psych::corr.test(b_loyal_10)

# Visualize b_loyal_10 correlation matrix
corrplot::corrplot(cor(b_loyal_10))

# Parallel analysis
psych::fa.parallel(b_loyal_10)


brand_rep_9 <- file001 %>%
    mutate(poor_workman = 6-poor_workman_r) %>%
    select(-poor_workman) %>%
    as.data.frame()
glimpse(brand_rep_9)


# Scree plot
psych::scree(brand_rep_9)

# Conduct three-factor EFA
brand_rep_9_EFA <- psych::fa(brand_rep_9, nfactors = 3)

# Print output of EFA
names(brand_rep_9_EFA)


# Summarize results of three-factor EFA
summary(brand_rep_9_EFA)

# Build and print loadings for a two-factor EFA
brand_rep_9_EFA_2 <- psych::fa(brand_rep_9, nfactors = 2)
brand_rep_9_EFA_2$loadings

# Build and print loadings for a four-factor EFA
brand_rep_9_EFA_4 <- psych::fa(brand_rep_9, nfactors = 4)
brand_rep_9_EFA_4$loadings


# (Factor loadings greater than 1, while rare, are not necessarily an issue.)

# Three factor EFA - brand_rep_9
brand_rep_9_EFA_3 <- psych::fa(brand_rep_9, nfactors = 3)

# Eigenvalues
brand_rep_9_EFA_3$e.values

# Factor score correlations
brand_rep_9_EFA_3$score.cor

# Factor loadings
brand_rep_9_EFA_3$loadings


# Create brand_rep_8 data frame
brand_rep_8 <- brand_rep_9 %>% select(-one_of_a_kind)

# Create three-factor EFA
brand_rep_8_EFA_3 <- psych::fa(brand_rep_8, nfactors=3)

# Factor loadings
brand_rep_8_EFA_3$loadings

# Factor correlations -- 9 versus 8 item model
brand_rep_8_EFA_3$score.cor
brand_rep_9_EFA_3$score.cor


# Three factor EFA loadings
brand_rep_8_EFA_3$loadings

# Two factor EFA & loadings
brand_rep_8_EFA_2 <- psych::fa(brand_rep_8, nfactors = 2)
brand_rep_8_EFA_2$loadings

# Four factor EFA & loadings
brand_rep_8_EFA_4 <- psych::fa(brand_rep_8, nfactors = 4)
brand_rep_8_EFA_4$loadings

# Scree plot of brand_rep_8
psych::scree(brand_rep_8)


# Standardized coefficient alpha
psych::alpha(brand_rep_9)$total$std.alpha

# 3-factor EFA
brand_rep_9_EFA_3 <- psych::fa(brand_rep_9, nfactors = 3)
brand_rep_9_EFA_3$loadings

# Standardized coefficient alpha - refined scale
psych::alpha(brand_rep_8)$total$std.alpha

# A survey with poorly-loading items can still be reliable - that's why we do EFA first
# Remember that a reliable survey in itself is not the goal of measurement - it is necessary but not sufficient


# Get names of survey items
names(brand_rep_8)

# Create new data frames for each of three dimensions
p_quality <- brand_rep_8 %>% select(1:3)
p_willingness <- brand_rep_8 %>% select(4:6)
# p_difference <- brand_rep_8 %>% select(7:8)

# Check the standardized alpha for each dimension
psych::alpha(p_quality)$total$std.alpha
psych::alpha(p_willingness)$total$std.alpha
# psych::alpha(p_difference)$total$std.alpha
psych::alpha(brand_rep_8)$total$std.alpha


# Get split-half reliability 
psych::splitHalf(brand_rep_8)

# Get averages of even and odd row scores
odd_scores <- rowMeans(brand_rep_8[c(TRUE, FALSE), ])
even_scores <- rowMeans(brand_rep_8[c(FALSE,TRUE), ])

# Correlate scores from even and odd items
cor(odd_scores[1:length(even_scores)], even_scores)


# 3 factor EFA
b_loyal_10_EFA_3 <- psych::fa(b_loyal_10, nfactors = 3)

# Factor loadings, eigenvalues and factor score correlations
b_loyal_10_EFA_3$loadings
b_loyal_10_EFA_3$e.values
b_loyal_10_EFA_3$score.cor

# 2 factor EFA
b_loyal_10_EFA_2 <- psych::fa(b_loyal_10, nfactors = 2)

# Factor loadings, eigenvalues and factor score correlations
b_loyal_10_EFA_2$loadings
b_loyal_10_EFA_2$e.values
b_loyal_10_EFA_2$score.cor

Chapter 3 - Confirmatory Factor Analysis and Construct Validation

CFA and EFA:

  • Confirmatory Factor Analysis (CFA) is a means of construct validation
    • Do the number of factors reflected in the data match theory and hypotheses
  • Can use the lavaan package for latent variable analysis
    • la(tent) va(riable) an(alaysis)
    • Use =~ to assign items to factors
  • Example for using lavaan on the 9-item survey
    • bq_9_CFAModel <- “VAL =~ reason_price + happy_pay + good deal PERF =~ serious + leader + strong_perform FUN =~ trendy + latest + tired_r”
    • bq_9_CFA <- cfa(model = bq_9_CFAModel, data=bq_9)
    • summary(bq_9_CFA, fit.measures=TRUE, standardized=TRUE)
    • inspect(bq_9_CFA, “std”)$lambda
    • semPlot::semPaths(bq_9_CFA)

CFA Assumptions and Interpretation:

  • Can test for multivariate normality - p-values for skewness and kurtosis
    • psych::mardia(myData)
  • The default for lavaan is maximum likelihood, which assumes normality
    • Can instead use MLR to mitigate non-normality
    • bq_cfa <- cfa(model=myModel, data=myData, estimator=“MLR”)
  • Can look at fit measures and assess model performance
    • CFI (comparative fit index) - should be 0.9+
    • TLI (Tucker Lewis Index) - should be 0.9+
    • Chi-squared - should be < 0.05 (though often will be for large sample sizes, even with a bad model)
    • RMSEA - ideally less than 0.05
  • Can use the fit measures function to get 42 fit measures
    • fitMeasures(myModel)
    • fitMeasures(myModel, fit.measures=c(“cfi”, “tli”))
  • Can inspect estimates using standardizedSolution()
    • standardizedSolution(myModel)

Construct Validity:

  • Construct validity is the extent to which the actual measurements and the claims of what is being measured are congruent
    • Validity is like being centered on a bullseye
    • Reliability is based on being tightly clustered around the mean
  • If two dimensions are measuring the same things, then they should be combined in the interests of parsimony
    • semTools::reliability(myModel)
  • Discriminant validity means that items should be distinct, but not unrelated
    • avevar should be 0.5+
    • CR (omega) should be 0.7+
    • alpha (Cronbach) should be 0.7+

Example code includes:

brand_rep_EFA <- brand_rep_8_EFA_3
brand_rep_8_model <- 'F1 =~ well_made + consistent + poor_workman_r
F2 =~ higher_price + lot_more + go_up
F3 =~ stands_out + unique'
brand_rep_CFA <- lavaan::cfa(model=brand_rep_8_model, data=brand_rep_8)


# Factor loadings -- EFA
brand_rep_EFA$loadings

# Factor loadings -- CFA
lavaan::inspect(brand_rep_CFA, what = "std")$lambda

# Plot diagram -- EFA
psych::fa.diagram(brand_rep_EFA)

# Plot diagram -- CFA
semPlot::semPaths(brand_rep_CFA)


library(lavaan)

# Rename items based on proposed dimensions
colnames(b_loyal_10) <- c("ID1", "ID2", "ID3", "PV1", "PV2", "PV3", "BT1", "BT2", "BT3", "BT4")

# Define the model
b_loyal_cfa_model <- 'ID =~ ID1 + ID2 + ID3
                    PV =~ PV1 + PV2 + PV3
                    BT =~ BT1 + BT2 + BT3 + BT4'
                        
# Fit the model to the data
b_loyal_cfa <- lavaan::cfa(model=b_loyal_cfa_model, data=b_loyal_10)

# Check the summary statistics -- include fit measures and standardized estimates
summary(b_loyal_cfa, fit.measures=TRUE, standardized=TRUE)


# Two dimensions: odd- versus even-numbered items
bad_model <- 'ODD =~ CS1 + CS3 + CS5 + CS7 + CS9
              EVEN =~ CS2 + CS4 + CS6 + CS8 + CS10'
                
# Fit the model to the data
c_sat_bad_CFA <- cfa(model=bad_model, data=file003)

# Summary measures
summary(c_sat_bad_CFA, fit.measures=TRUE, standardized=TRUE)


c_sat_model <- 'F1 =~ CS1 + CS2 + CS3 + CS4
F2 =~ CS5 + CS6 + CS7
F3 =~ CS8 + CS9 + CS10'
c_sat_50 <- file003[1:50, ]


# Mardia's test for multivarite normality
psych::mardia(c_sat_50)

# Fit model to the data using robust standard errors
c_sat_cfa_mlr <- cfa(model=c_sat_model, data=c_sat_50, estimator="MLR")

# Summary including standardized estimates and fit measures
summary(c_sat_cfa_mlr, fit.measures=TRUE, standardized=TRUE)


c_sat_model_a <- 'F1 =~ CS1 + CS2 + CS3 + CS4
F2 =~ CS5 + CS6 + CS7
F3 =~ CS8 + CS9 + CS10'
c_sat_model_b <- 'F1 =~ CS1 + CS3 + CS5 + CS7 + CS9
F2 =~ CS2 + CS4 + CS6 + CS8 + CS10'


# Fit the models to the data
c_sat_cfa_a <- cfa(model = c_sat_model_a, data = file003)
c_sat_cfa_b <- cfa(model = c_sat_model_b, data = file003)

# Print the model definitions
cat(c_sat_model_a)
cat(c_sat_model_b)

# Calculate the desired model fit statistics
fitMeasures(c_sat_cfa_a, fit.measures=c("cfi", "tli"))
fitMeasures(c_sat_cfa_b, fit.measures=c("cfi", "tli"))


c_sat <- file003
names(c_sat) <- c("CSU1", "CSU2", "CSU3", "CSU4", "EU1", "EU2", "EU3", "PS1", "PS2", "PS3")


# Add EU1 to the CSU factor
c_sat_model_a <- 'CSU =~ CSU1 + CSU2 + CSU3 + CSU4
                EU =~ EU1 + EU2 + EU3
                PS =~ PS1 + PS2 + PS3'

# View current c_sat model
cat(c_sat_model_a)

# Add EU1 to the CSU factor
c_sat_model_b <- 'CSU =~ CSU1 + CSU2 + CSU3 + CSU4 + EU1
                EU =~ EU1 + EU2 + EU3
                PS =~ PS1 + PS2 + PS3'

# Fit Models A and B to the data
c_sat_cfa_a <- cfa(model = c_sat_model_a, data = c_sat)
c_sat_cfa_b <- cfa(model = c_sat_model_b, data = c_sat)

# Compare the nested models
anova(c_sat_cfa_a, c_sat_cfa_b)


# Fit the model to the data 
# c_sat_cfa <- cfa(model = c_sat_model, data = c_sat_group, group = "COUNTRY")

# Summarize results -- include fit measures and standardized estimates
# summary(c_sat_cfa, fit.measures=TRUE, standardized=TRUE)

# Get average estimate for both groups
# standardized_solution <- standardizedSolution(c_sat_cfa)
# standardized_solution %>%
#   filter(op == "=~") %>%
#   group_by(group) %>% 
#   summarize(mean(est.std))


c_sat_cfa_model_3 <- 'F1 =~ CS1 + CS2 + CS3 + CS4
F2 =~ CS5 + CS6 + CS7
F3 =~ CS8 + CS9 + CS10'
c_sat_cfa_model_2 <- 'F1 =~ CS1 + CS2 + CS3 + CS4 + CS5 + CS6 + CS7
F2 =~ CS8 + CS9 + CS10'

# Fit three-factor CFA
c_sat_cfa_3 <- cfa(model = c_sat_cfa_model_3, data = file003)

# Inspect key fit measures - three-factor CFA
fitMeasures(c_sat_cfa_3, fit.measures = c("cfi","tli","rmsea"))

# Fit two-factor CFA
c_sat_cfa_2 <- cfa(model = c_sat_cfa_model_2, data = file003)

# Inspect key fit measures - two-factor CFA
fitMeasures(c_sat_cfa_2, fit.measures = c("cfi","tli","rmsea"))

# Compare measures of construct validity for three- versus two-factor models
semTools::reliability(c_sat_cfa_3)
semTools::reliability(c_sat_cfa_2)


brand_rep_CFA_model <- 'F1 =~ well_made + consistent + poor_workman_r
F2 =~ higher_price + lot_more + go_up
F3 =~ stands_out + unique'
brand_rep_CFA <- lavaan::cfa(model=brand_rep_8_model, data=brand_rep_8)


# Print CFA model
cat(brand_rep_CFA_model)

# semTools reliability measures
semTools::reliability(brand_rep_CFA)

# psych standardized coefficient alpha measure
psych::alpha(brand_rep_9)$total$std.alpha


# Store F1 estimates as object loadings
loadings <- standardizedSolution(c_sat_cfa_3) %>%
    filter(op == "=~", lhs == "F1") %>% 
    select(est.std)

# Composite reliability
re <- 1 - loadings ^ 2
result <- sum(loadings) ^ 2 / ((sum(loadings)^ 2)  + sum(re))
result

# Average variance extracted
l2 <- loadings ^ 2
avg_var <- sum(l2) / nrow(loadings)
avg_var

# Compare versus semTools
semTools::reliability(c_sat_cfa_3)


# Print brand_rep_factors
# brand_rep_factors

# Build model for lavaan
brand_rep_8_cfa_model <- "QUAL =~ consistent + well_made + poor_workman_r
PRICE =~ go_up + lot_more + higher_price
UNIQUE =~ stands_out + unique"

# Summarize results with fit measures and standardized estimates
# summary(brand_rep_8_CFA, standardized = TRUE, fit.measures = TRUE)

# Construct validity
# semTools::reliability(brand_rep_8_CFA)

Chapter 4 - Criterion Validity and Replication

Concurrent Validity and Model Diagrams:

  • Criterion validity is a measure of relationship between the construct and external variable of interest
  • Variables are not always on the 1-5 scale, and differences in units can negatively impact model validity
    • describe(myData) # check means and standard deviations (ideally, everything is N(0, 1)
  • Can latentize a variable by adding it with =~
    • myModel <- ‘. age_fact =~ age’ # latentizes age to age_fact
  • Can correlate manifest and latent variable with ~~
    • myModel <- ‘. age_fact =~ ageage_fact ~~ F1 + F2 + F3’ # latentizes age to age_fact and gets latent statistics (F1, F2, F3 already defined in the model)
    • mySEM <- sem(myModel, data=myData, estimator=“MLR”)
    • summary(mySEM, fit.measures=TRUE, standardized=TRUE)
  • Diagrams are sometimes called “spaghetti and meatballs”, representing that they can be busy diagrams

Predictive Validity and Factor Scores:

  • Predictive validity assesses the degree to which models predict future outcomes
  • Linear regression can be used for this task
    • Begin by binding and scaling all of the relevant variables
  • Can run regression in lavaan using ~
    • c_sat_model = “. spend ~ F1 + F2 + F3” # assumes F1, F2, F3 each defined as usual using =~
    • semPaths(c_sat_sem, rotation=2)
    • standardizedSolution(c_sat_sem) %>% filter(op == “~”) %>% mutate_if(is.numeric, round, digits=3)
    • inspect(c_sat_rem, “r2”) # pull the R-squared
  • Factors scores are numerical scores reflecting relative standings on the latent factor
    • csat_cfa <- cfa(model = csat_model, data = c_sat)
    • csat_scores <- as.data.frame(predict(csat_cfa))
    • describe(csat_scores)
    • multi.hist(csat_scores)

Repeated Measures, Replication, and Factor Scores:

  • Stability is a third form of reliability measurement
    • Does an instrument get similar responses if measuring the same population near the same time?
    • “Test-retest reliability”
    • survey_test_retest <- testRetest(t1 = survey_t_1, t2 = survey_t_2, id = “id”)
    • Generally, scores of 0.9+ are very good and scores of 0.7- are unreliable
  • Replication is a different step that can be taken in the event that it is not possible to get people to retake the survey
    • Split the data by rows - odd vs. even

Wrap Up:

  • Six step process for building and testing models

Example code includes:

spendData <- c(94.5, 715, 145.5, 772.5, 133.5, 350, 75.5, 304.5, 117, 81, 234.5, 102, 152.5, 295, 145, 222, 121.5, 142, 82.5, 144, 130, 141, 545, 142.5, 175, 154, 130, 148.5, 255, 139.5, 420, 373.5, 197.5, 487.5, 337.5, 133.5, 114, 84, 255.5, 129, 114, 275, 297, 84, 87, 109.5, 123, 405, 123, 158, 145, 139.5, 112.5, 458, 138, 91.5, 190, 257.5, 155, 259, 120, 84, 84, 755, 84, 412, 270, 134, 285, 227.5, 133.5, 123, 127.5, 825, 418, 103.5, 144, 124, 120, 445.5, 150.5, 75, 129, 312, 330, 182.5, 282, 91.5, 218, 245, 157.5, 118.5, 148.5, 505, 87, 182, 111, 294, 110, 325.5, 115.5, 312, 120, 510, 91.5, 139.5, 85.5, 189, 152, 141, 138, 387, 114, 84, 213, 120, 115.5, 231, 78, 85.5, 354, 142.5, 128, 212, 547.5, 145, 103.5, 294, 354, 182.5, 185, 212, 97.5, 103.5, 235, 395, 105.5, 283.5, 155, 91.5, 94.5, 297.5, 283.5, 125, 159, 139.5, 95, 198, 104, 138, 155, 200, 97.5, 224, 588, 108, 100.5, 183, 350, 153, 150, 155, 91.5, 138, 117, 135.5, 138, 202, 257, 103.5, 114, 282, 112, 198, 159, 420, 315, 402, 507, 259.5, 81, 127.5, 144, 225, 141, 84, 150, 150.5, 455, 75, 294, 102, 199.5, 385, 155, 144, 135, 142.5, 172, 390, 94.5, 153, 472.5, 105, 123, 188, 325, 504, 99, 111, 151.5, 78, 545, 170, 123, 93, 381.5, 735, 100.5, 97.5, 155, 252.5, 192, 132, 252.5, 121.5, 90, 257, 151.5, 94.5, 153.5, 311.5, 79.5, 284, 151.5, 95, 78, 480, 102, 215, 115.5, 330, 592.5, 79.5, 355.5, 195, 105.5, 142.5, 154, 155, 312, 321, 75.5, 185, 324, 155, 530, 127.5, 148.5, 152, 111, 157.5, 151.5, 772.5, 123, 115.5, 145.5, 84, 515, 82.5, 108, 130.5, 138, 279, 151.5, 207, 150, 109.5, 150, 153.5, 152.5, 150.5, 88.5, 185, 115.5, 123, 150, 114, 321, 144, 142.5, 152, 82.5, 187.5, 97.5, 145, 257, 435, 250, 310.5, 78, 105.5, 102, 138, 303, 285, 155, 124.5, 240, 204, 118.5, 241.5, 147, 118.5, 105, 591.5, 180, 93, 252, 103.5, 287, 575, 75, 238, 189, 204, 210, 153, 145.5, 117, 559, 153.5, 79.5, 222.5, 145.5, 88.5, 159, 155, 255, 127.5, 300, 154, 213, 135, 84, 151.5, 127.5, 99, 200, 135, 522.5, 297, 152, 127.5, 203, 103.5, 178.5, 130.5, 255, 100.5, 213, 185, 228, 115.5, 109.5, 75.5, 273, 511, 414, 152, 217, 150, 102, 537.5, 282, 440, 288, 172.5, 112.5, 577.5, 140, 291.5, 152, 582.5, 210, 318.5, 185, 145.5, 148.5, 324, 145, 105.5, 132, 85.5, 135, 152, 135, 324, 200, 155, 247.5, 197.5, 95, 304.5, 215, 577.5, 111, 495, 141, 139.5, 112.5, 110, 135.5, 97.5, 157.5, 243, 159, 155, 185, 155, 114, 395, 130.5, 238, 345.5, 597.5, 210, 220, 210, 222.5, 124.5, 158, 150, 490.5, 270, 88.5, 205, 135, 90, 152, 309, 153, 105, 111, 78, 123, 159, 95, 115, 435, 235, 292.5, 155, 304.5, 114, 104, 135, 397.5, 93, 257, 102, 204, 252, 152, 215, 108, 148.5, 79.5, 155, 114, 94.5, 118.5, 178.5, 111, 150.5, 195, 85.5, 84, 93, 575, 148.5, 757.5, 155, 87, 112.5, 88.5, 255, 358, 84, 405, 153, 127.5, 81, 135.5, 154.5, 247.5, 182, 79.5, 373.5, 95, 147, 145.5, 152.5, 294, 259.5, 354, 103.5, 187.5, 124.5, 218, 227.5, 481.5, 125, 123, 450, 129, 318, 170, 319.5, 91.5, 183, 154, 391.5, 458, 303, 114, 111, 112.5, 222, 742.5, 234, 120, 81, 312, 335, 135, 133.5, 118.5, 390, 518, 215, 373.5, 118.5, 195, 111, 205, 94.5, 123, 99, 75.5, 102, 244, 380, 357.5, 254, 227.5, 198, 192.5, 151.5)
brand_rep_spend <- tibble::tibble(spend=spendData)
brand_rep <- file001 %>%
    mutate(poorworkman = 6-poor_workman_r) %>%
    select(-poorworkman)
brand_rep
brand_rep_spend


# Check if brand_rep and brand_rep_spend have the same number of rows
same_rows <- nrow(brand_rep) == nrow(brand_rep_spend)
same_rows

# Append spend column to brand_rep
brand_rep <- cbind(brand_rep, brand_rep_spend)

# Scale the data
brand_rep_scaled <- scale(brand_rep)

# Get summary statistics of scaled dataframe
psych::describe(brand_rep)
psych::describe(brand_rep_scaled)


# Correlate F1, F2 and F3 to spend_f, the 'latentized' spend
brand_rep_model <- 'F1 =~ well_made + consistent + poor_workman_r
F2 =~ higher_price + lot_more + go_up
F3 =~ stands_out + unique
spend_f =~ spend
spend_f ~~ F1 + F2 + F3'

# Fit the model to the data -- sem()
brand_rep_cv <- lavaan::sem(data = brand_rep_scaled, model = brand_rep_model)

# Print the standardized covariances b/w spend_f and other factors
lavaan::standardizedSolution(brand_rep_cv) %>% 
    filter(rhs == "spend_f")

# Plot the model with standardized estimate labels
semPlot::semPaths(brand_rep_cv, whatLabels = "est.std", edge.label.cex = .8)


c_sat <- file003
c_sat_recommend <- tibble::tibble(Rec_1=c(4, 3, 3, 3, 4, 3, 2, 3, 3, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 3, 3, 4, 3, 4, 4, 3, 2, 3, 3, 3, 4, 4, 3, 4, 4, 5, 3, 4, 4, 4, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 2, 3, 3, 4, 4, 3, 4, 3, 3, 4, 3, 3, 3, 1, 3, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 2, 4, 4, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 4, 2, 4, 3, 4, 3, 3, 4, 3, 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 3, 2, 4, 3, 4, 2, 3, 4, 3, 3, 3, 4, 4, 4, 3, 3, 4, 3, 3, 4, 4, 4, 3, 2, 3, 3, 4, 2, 5, 3, 4, 4, 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 3, 3, 3, 4, 4, 4, 4, 3, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 2, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 2, 4, 4, 3, 3, 3, 4, 3, 3, 4, 4, 3, 3, 5, 4, 4, 3, 4, 2, 4, 4, 4, 4, 3, 4, 3, 3, 3, 2, 3, 3, 3, 4, 3, 3, 4, 5, 4, 4, 3, 4, 4, 3, 3, 4, 4, 3, 4, 3, 4, 4, 4, 1, 4, 3, 4, 3, 3, 3, 4, 3, 5, 4, 3, 3, 4, 3, 4, 3, 3, 3, 3, 4, 2, 3, 4, 3, 4, 3, 4, 4, 4, 3, 3, 3, 5, 3, 3, 4, 3, 3, 4, 4, 4, 4, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 4, 2, 4, 3, 2, 3, 3, 5, 4, 2, 5, 3, 5, 3, 2, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 5, 4, 4, 3, 3, 5, 3, 4, 3, 5, 4, 3, 2, 2))
c_sat
c_sat_recommend


# Bind & scale the variables
c_sat_rec_scale <- c_sat %>% 
    bind_cols(c_sat_recommend) %>% scale()

# Define the model - Rec_f covaries with F1, F2, F3
c_sat_rec_model <- 'F1 =~ CS1 + CS2 + CS3 + CS4
F2 =~ CS5 + CS6 + CS7
F3 =~ CS8 + CS9 + CS10
Rec_f =~ Rec_1
Rec_f ~~ F1 + F2 + F3'

# Fit the model to the data 
c_sat_rec_sem <- lavaan::sem(model = c_sat_rec_model, data = c_sat_rec_scale)

# Look up standardized covariances
lavaan::standardizedSolution(c_sat_rec_sem) %>% 
    filter(rhs == "Rec_f")


# Define the model
b_q_model <- 'HIP =~ trendy + latest + tired_r
            VALUE =~ happy_pay + reason_price + good_deal
            PERFORM =~ strong_perform + leader + serious
            spend ~ HIP + VALUE + PERFORM'

# Fit the model to the data
# b_q_pv <- lavaan::sem(data = b_q_scale, model = b_q_model)

# Check fit, r-square, standardized estimates
# summary(b_q_pv, standardized = TRUE, fit.measures = TRUE, rsquare = TRUE)

# Plot the model -- rotate from left to right
# semPlot::semPaths(b_q_pv, rotation = 2, whatLabels = "est.std", edge.label.cex = 0.8)


# Plot the new model
# semPlot::semPaths(brand_rep_sem, rotation = 2)

# Get the coefficient information
# lavaan::standardizedSolution(brand_rep_sem) %>% filter(op == "~")

# Get the r-squared
# r_squared <- car::inspect(brand_rep_sem, "r2")["F2"]
# r_squared


# Compute factor scores in lavaan -- store as data frame
brand_rep_scores <- as.data.frame(predict(brand_rep_CFA))

# Summary statistics of our factor scores
psych::describe(brand_rep_scores)

# Plot histograms for each variable
psych::multi.hist(brand_rep_scores)

# Are they normally distributed? Check using map()
map(brand_rep_scores, shapiro.test)


# Linear regression of standardized spending and factor scores
# bq_fs_reg <- lm(spend ~ F1 + F2 + F3, data = bq_fs_spend)

# Summarize results, round estimates
# rounded_summary <- round(coef(bq_fs_reg), 3)
# rounded_summary

# Summarize the results of CFA model
# summary(brand_qual_pv)

# Compare the r-squared of each
# inspect_rsq <- car::inspect(brand_qual_pv, "r2")["spend"]
# inspect_rsq
# summary(bq_fs_reg)$r.squared


# Descriptive statistics grouped by 'time'
# psych::describeBy(brand_rep_t1_t2, "time")

# Test retest: time == 1 versus time == 2 by id = "id"
# brand_rep_test_retest <- psych::testRetest(t1 = filter(brand_rep_t1_t2, time == 1), t2 = filter(brand_rep_t1_t2, time == 2), id = "id")

# brand_rep_test_retest$r12


brand_rep <- file001 %>%
    select(-one_of_a_kind)
brand_rep


# Split data into odd and even halves
brand_rep_efa_data <- as.data.frame(brand_rep)[c(TRUE,FALSE),]
brand_rep_cfa_data <- as.data.frame(brand_rep)[c(FALSE,TRUE),]

# Get factor loadings of brand_rep_efa_data EFA
efa <- psych::fa(brand_rep_efa_data, nfactors = 3)
efa$loadings

# Confirm the data that the model was fit to
# car::inspect(brand_rep_cfa, what = "call")

# Check fit measures
# fitmeasures(brand_rep_cfa)[c("cfi", "tli", "rmsea")]

Data Science for Managers

Chapter 1 - Introduction to Data Science

What is Data Science?

  • Data science is a set of methodologies for drawing meaningful conclusions from available data
    • Describe the current state (reporting, etc.)
    • Detect anomlaous events (fraud detection, defects, etc.)
    • Diagnosis of causes of events and behaviors
    • Predict future events and outcomes
  • Data science workflows typically include
    • Data collection
    • Exploration and visualization
    • Experimentation and prediction

Applications of Data Science:

  • Example of a fraud detection process - likelihood that a transaction is fake
    • Need many examples and a label describing each transaction as fraudulent or valid
  • Begin with a well-defined questions such as “what is the probability that a specific transaction is fraudulent?”
  • Example of building a smart watch to auto-detect activities - acclerometer
  • The Internet of Things (IoT) is often combined with Data Science
    • Smart watches
    • Electronic tool collection
    • Etc.
  • Image recognition (humans) for self-driving cars
    • Deep learning uses multiple layers of neurons (mini-layers) to draw conclusions
    • Deep learning needs large volumes of data and draws conclusions that would not be possible with many other types of models

Building a Data Science Team:

  • Data engineers, data analysts, and machine learning scientist
    • Data engineers build the infrastructure - SQL, Java, Scala, Python
    • Data analysts desribe the present using data (dashboards, hypothesis testing) - SQL, Excel, Tableau/BI Tools
    • Machine learning scientists extrapolate from what is already known (test-train, prediction, image classification, etc.) - Python, R
  • Programming languages are easier to learn than spoken languages - often can read languages even if you cannot write them
  • Three main ways to structure teams - isolated, embedded, hybrid
    • Isolated - standalone (good for training new team members and rotating projects)
    • Embedded - part of squads of engineers and managers (good for gaining SME on a business area)
    • Hybrid - Embedded, but with a layer that is also specific to the data science track

Chapter 2 - Data Sources and Risks

Data Sources and Risks:

  • Many sources of data - web, transactions, logistics, customers, etc.
  • Web data can be useful - user, web page, element, time stamp, etc.
  • Personally Identifiable Information (PII) is name, e-mail, address, or anything else that could be ties back to a real human
    • PII should be trated very carefully, perhaps through a userID rather than tracking by e-mail address
    • Data is “pseudonymized” if PII could be attained using joins of multiple tables
    • Destroying or deleting the users data would then fully anonymize the data
  • The GDPR is the “General Data Protection Regulation” inside the EU, and gives individuals access to their personal data

Solicited Data:

  • Solicited data can be used to create marketing collateral - e.g., surveys, reviews, questionnaires, focus groups, etc.
  • The NPS (Net Promoter Score) is a common form of solicted data
  • Solicted data can be either quantitative or qualitative
    • Qualitative data (small-scale) can be helpful for generating hypotheses
    • Quantitative data (larger-scale) can be helpful for testing hypothesis and making decisions
  • Revealed and stated preferences can be different
  • Best practices are to be as specific as possible when asking questions
    • Avoid loaded language - be as objective as possible
    • Calibrate by comparing to competitor or company offerings
    • Limit the number of questions ot those that will help with taking a decisive action

Collecting Additional Data:

  • Data from external sources can include API, Public Records, Mechanical Turk
  • The API is an “Application Programming Interface”, and can be a good way to request/collect data from the internet
    • In the US, data.gov can be a valuable source
    • In the EU, data.europa.eu can be a valuable source
  • A good image classification process requires a large training set with labels
    • Mechanical Turk (MTurk) can be very helpful - ask humans to do some of the labelling, similar to a CAPTCHA (though could be a paid process)
    • AWS Mturk is one resource for organizing a Mechanical Turk process

Data Storage and Retrieval:

  • Cloud storage is paying a third party to house data - AWS, Google Cloud, etc.
  • Data can be unstructured or structured - often stored in a document database
    • Data querying can be done in many ways - by date, by user, etc.
    • Each type of database has its own language - NoSQL (not only SQL), SQL, etc.
  • Building a corporate database depends on picking a location, deciding on the language/design (document vs. relational), system for referencing and retrieving data

Chapter 3 - Analysis and Visualization

Dashboards:

  • A dashboard is a set of metrics that updates on a schedule - real-time, daily, weekly, etc.
    • Can track distributions over time, distributions by category, time series trends, etc.
  • Display text may be added to dashboards for a splash of qualitative content
  • Dashboards can be built with spreadsheet tools or BI tools

Ad hoc analysis:

  • An ad hoc request is a request for data or analysis that does not need to be repeated on a regular basis
  • Example of a one-time request to understand the impact of an advertising campaign
    • Requires specificity, context, priority, importance, etc.
    • Ticket systems are sometimes used to manage ad hoc requests

A/B Testing:

  • A/B Testing is a type of experiment for de-risking potential changes to the business
    • Example of randomly dividing the audience in to two groups and then showing each a different web page title
    • Pick Metric -> Determine Sample Size -> Conduct Experiment -> Run Assessment

Chapter 4 - Prediction

Supervised Machine Learning:

  • Machine learning is the process of making predictions from data
    • Supervised machine learning has both features and labels
  • Example of a subscription business and a goal of determining whether the customer will churn
    • Historical customer data available, with labels as to whether they have churned
  • Can then evaluate the model based on a hold-out sample (test-train methodology)

Clustering:

  • Clustering can help with pattern recognition in messy datasets
  • Example of working for an airline, and trying to segment the customers
    • Number of flights, % international, % business class, mean days in advance booked

Special Topics in Machine Learning:

  • Seasonality can be weekly, monthly, yearly, summer, etc.
  • NLP (Natural Language Processing) is an example of a machine learning topic
    • Features can be word counts
    • Word count challenges can include negation and synonyms

Deep Learning and Explainable AI:

  • Deep learning (neural networks) is a special type of machine learning that can solve more complex problems and that requires large quantities of input data
    • Generally lack of explainability, though with good predictive power
    • Explainable AI (artificial intelligence) would have factors that can be explained
  • Explainable AI will include both predictions AND explanations (factors that drive the predictions)

R for SAS Users

Chapter 1 - Getting Started with R

Get Help and Load Data in R:

  • R Functionality includes packages and global environment
    • ls() is similar to PROC DATASETS
    • load() is similar to LIBNAME or DATA + SET
  • Generally, an R session begins without any objects loaded in memory
  • Abalone dataset is available from UC Irvine
    • load(“abalone.Rdata”)
    • library(myPkg) will load myPkg
    • sessionInfo() shows the currently loaded packages

Dataset Contents and Descriptive Statistics:

  • Can use readr::read_csv() to load a CSV file in to the environment
    • myObject <- readr::read_csv(“myFile.csv”)
    • str(myObject) # look at the variables and dimensions of the dataset
    • dim(myObject) # dimensions of the object
    • names(myObject) # variable names
    • head(myObject) # first 6 rows
    • tail(myObject) # last 6 rows
  • The dplyr functions are helpful for manipulating data frames
    • arrange(myDF, myCol) # sort by ascending myCol
    • myDF %>% arrange(myCol) # sort by ascending myCol (same as above)
    • myDF %>% pull(myCol) # extract myCol as a vector
    • myDF %>% pull(myCol) %>% mean() # extract myCol as a vector and take the mean()
    • myDF %>% select(a, b) %>% summary() # keep only columns “a” and “b”, then create a summary of that frame

Graphical Visualizations:

  • The ggplot2 package is useful for plotting data in R
    • “grammar of graphics” - layering approach to graphics
    • Build a base layer and then add to it
    • ggplot(data=myDF, aes(x=x, y=y)) + geom_boxplot() + theme_bw() # build a boxplot of y by x using the bw theme
    • Can include options like color= or fill=
    • Can customize axes using xlab(), ylab(), and ggtitle()

Example code includes:

# List the objects in global environment
# ls()

# Load the "abalone.RData" dataset
# load("abalone.RData")

# List objects in global environment again 
# ls()

# Learn more about the load function
# help(load)

# Learn more about the ls function
# help(ls)


# Run sessionInfo() see packages available to this session
# sessionInfo()

# Load Hmisc package
# library(Hmisc)

# Run sessionInfo() again to see updated package list
# sessionInfo()


# Load abalone.csv dataset assign output to abalone
# abalone <- readr::read_csv("abalone.csv")

data(abalone, package = "AppliedPredictiveModeling")
abalone <- tibble::as_tibble(abalone)

# Get the dimensions of the abalone dataset object
dim(abalone)

# Get variable names in the abalone dataset object
names(abalone)

chgName <- function(x) {
    paste0(stringr::str_to_lower(stringr::str_sub(x, 1, 1)), stringr::str_sub(x, 2))
}

tmpNames <- sapply(names(abalone), FUN=chgName, USE.NAMES=FALSE)
names(abalone) <- tmpNames
names(abalone)


# View top 3 rows of abalone dataset using head()
head(abalone, 3)

# View bottom 3 rows of abalone dataset using tail()
tail(abalone, 3)

# Run arrange function from dplyr to sort the data by rings
arrange(abalone, rings)

# Rewrite the line of code above using the %>% notation
abalone %>% 
    arrange(rings)


# Find mean length of abalones using pull() and mean() 
abalone %>% 
    pull(longestShell) %>% 
    mean()

# Find the median wholeWeight of the abalones
abalone %>% 
    pull(wholeWeight) %>% 
    median()

# Get descriptive statistics of diameter and shellWeight
abalone %>% 
    select(diameter, shellWeight) %>% 
    summary()


# Add a title and labels for the axes
ggplot(abalone, aes(shellWeight)) + 
    geom_histogram(color = "blue", fill = "yellow") +
    xlab("Shell Weight") + 
    ylab("Frequency Counts") + 
    ggtitle("Shell Weights Histogram")


# Change the boxplots to the violin geom
ggplot(data = abalone, aes(x=type, y=shuckedWeight)) + 
    geom_violin() + 
    theme_bw()


# Create panel plot of scatterplot for sex categories
ggplot(abalone, aes(diameter, wholeWeight)) + 
    geom_point() + 
    geom_smooth() + 
    facet_wrap(~type)

Chapter 2 - Data Wrangling